FWIW, KUMC manages this with a kludge of whitelists an regexps... Though I guess I've seen worse published as "natural language processing" methods, I agree that this isn't the sort of thing we should be doing for the PCORNet CDM.
https://github.com/kumc-bmi/heron/blob/heron-deercreek/heron_load/curated_data/componentids_whitelist.csv https://github.com/kumc-bmi/heron/blob/heron-deercreek/heron_load/curated_data/lab_regex.csv -- Dan ________________________________ From: Gpc-dev <gpc-dev-boun...@listserv.kumc.edu> on behalf of Susan Rea <susan....@imail.org> Sent: Monday, September 28, 2020 7:17 PM To: Manuel, Laura S M <manue...@uthscsa.edu>; Stoddard, Alexander <astodd...@mcw.edu>; gpc-dev@listserv.kumc.edu <gpc-dev@listserv.kumc.edu> Subject: RE: CDM 6.0 review responses from MCW Thank you, Laura and Alex, for reviewing the changes. I have a few comments on your comments, added inline below, marked with *SR*. We put our local team together to answer their Survey a few weeks ago so had an earlier chance for input. Thanks, Susan Rea -----Original Message----- From: Gpc-dev <gpc-dev-boun...@listserv.kumc.edu> On Behalf Of Manuel, Laura S M Sent: Friday, September 25, 2020 11:15 AM To: Stoddard, Alexander <astodd...@mcw.edu>; gpc-dev@listserv.kumc.edu Subject: RE: CDM 6.0 review responses from MCW BE ALERT. External Sender. Be cautious. Thanks Alex for diving into this first and stating things so eloquently. I agree with everything Alex put and would add: >If we move the records from VITAL to OBS_CLIN, we need to merge the >valuesets for the provenance fields. If we do that, OBSCLIN_SOURCE would >contain OD (Order/EHR), RG (Registry/ancillary system) and HC (Healthcare >delivery setting). >There is a fair amount of overlap between these terms. We are proposing to >deprecate OD and RG and utilize HC instead (we will make the same change to >OBSGEN_SOURCE as well). >Any concerns with this change? Registries normally contain chart abstracted data which can be useful, but also adds an additional step for human error. I believe it would be useful to keep the distinction between potentially interpreted data and raw data from the EHR. >Addition of Result_text This value would likely be a free text field and this may allow PHI values to slip through. We would not recommend the addition of a free text field in a limited data set. *SR* I agree we may have privacy issues populating free text, where we do find providers may use any convenient place to put a little note. So, we would have to carefully curate whatever data were requested. It would be helpful if DRNOC would identify specific lab tests that are needed or anticipated and narrow the curation task to what may be useful lab data. The "everything you got" strategy would really be difficult for sharing text results. Also, I like Alex's comment about bloating the table and his solution. /*SR* >Addition of Raw Condition Text This value is a free text field at one of our institutions and a value set at another. We could use the value set, but would not be able to add a free text string to a limited data set. *SR* We have the patient reported reason for visit as free text and appears to be literal short version of what patient tells clerk when making appointment or what they or family told admitting clerk. We also have coding specialists' Admitting Dx ICD code for hospital visits. We would also be suspicious for PHI in this text. Unsure of the value proposition for this versus intake nursing notes, if hospital or ED encounter. Patient may not be reliable reporter of symptoms if they are acutely ill at admission, or making an appointment for a sensitive problem. /*SR* -----Original Message----- From: Stoddard, Alexander <astodd...@mcw.edu> Sent: Thursday, September 24, 2020 10:02 PM To: gpc-dev@listserv.kumc.edu Cc: Taylor, Bradley <btay...@mcw.edu>; rwait...@kumc.edu; Manuel, Laura S M <manue...@uthscsa.edu> Subject: CDM 6.0 review responses from MCW Hello GPC-DEV, MCW agreed to review the CDM 6.0 spec during the dev call 2020-09-22. The replies to DRNOC, using an excel file template (available at https://urldefense.proofpoint.com/v2/url?u=https-3A__pcornet.imeetcentral.com_drnoc-2Dworkgroups_folder_WzIwLDEzMTI2ODA5XQ_&d=DwIFAw&c=II16XUCNF0uj2WHDMBdftpHZzyfqZU4E6o4J8m7Yfh-XF5deecOtjPXuMFvj1uWy&r=MwmdyHUR1MNPWZBi1oQ_Ksh4XI39nGu45nleZO875iA&m=ZgEr_8KiuJ9caTG5rYIXlYWcbHCYl2xRU1V7DOuK2Ok&s=3_hAUe-wl_W9YElfhi1wlFIpXD6IV9AyrIpsadNQMQY&e= ) , have been requested by end of day Friday 2020-09-25. Below are a text version of the responses that I will be sending on behalf of MCW. Main questions seeking feedback ------------------------------------- >As the CDM has grown in size, the image included in the specification (Page 9) > conveys less and less information. >Any concerns if it is deleted? Not a concern, but a highlighted list of changed tables/new columns on a single page is useful >Suggestions on what we might consider as a replacement? A machine readable, diff-able and version controlled schema definition would be very useful. Potentially this would allow tool assisted SQL generation for the different RDMS, or even visualization generation. A candidate for such a schema definition format would be that used by sql-alchemy python package: https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.sqlalchemy.org_en_13_core_metadata.html&d=DwIFAw&c=II16XUCNF0uj2WHDMBdftpHZzyfqZU4E6o4J8m7Yfh-XF5deecOtjPXuMFvj1uWy&r=MwmdyHUR1MNPWZBi1oQ_Ksh4XI39nGu45nleZO875iA&m=ZgEr_8KiuJ9caTG5rYIXlYWcbHCYl2xRU1V7DOuK2Ok&s=o_ogbev5sUlo1LGRfUUSH3IayPshWLSZQ5TjU2TyMfg&e= >Any there any concerns about the strategy to deprecate VITAL and move the >records to OBS_CLIN? OBS_CLIN is a much better data model for vitals but transitioning distinct columns in the VITALs table to a single column requiring different value-sets for different qualitative variables will be easier with a more agile and open process for value-set definition during the transition. Open appending of additional values to a version controlled value set reference would offer projects much greater flexibility to adopt additional tests and observations throughout the CDM lifecycle without any loss of specificity, accuracy or backwards compatibility. This is especially true of _QUAL columns that will hold values for many different results/observations unlike domain specific columns historically defined using the current process (e.g. RACE in the DEMOGRAPHIC table and SMOKING in the VITALS table) In general qualitative value-sets should be defined on the codes used to specify given observation rows, not the whole _QUAL column. >If we move the records from VITAL to OBS_CLIN, we need to merge the >valuesets for the provenance fields. If we do that, OBSCLIN_SOURCE would >contain OD (Order/EHR), RG (Registry/ancillary system) and HC (Healthcare >delivery setting). >There is a fair amount of overlap between these terms. We are proposing to >deprecate OD and RG and utilize HC instead (we will make the same change to >OBSGEN_SOURCE as well). >Any concerns with this change? EHR vs Registry seems like a valid source distinction. From experience the source fields are most often useful for data tracing in QC operations on individual records, rather than research and aggregation of the data. A richer value-set may therefore be of benefit to sites. >Is the description for Telehealth encounters sufficient, or is more detail >needed? Description is sufficient but the real issue is likely the specificity with which these encounters (vs routine telephone or other electronic communications) are recorded in the source systems of sites. >If we remove the VALUESET and VALUESET DESCRIPTOR columns from the >FIELDS tab of the parseable file, would that pose a problem? (The >VALUESETS tab would remain unchanged) No problem. The data in these columns is much more easily used as represented in the VALUESETS tab. A flag or categorical value to indicate a field uses a valueset on the VALUESETS tab would be useful. General Comments --------------------- None Value Sets ----------- See comments on the VITALS transition. LAB_HISTORY table --------------------- No particular issues with the schema definition. But MCW remains very dubious of the utility or accuracy possible with this table versus a centrally held one maintained by DRNOC. If a lab test is stable enough and well defined enough for population reference ranges (but doesn't have individual test normal ranges defined for a particular source) then a centrally maintained reference fallback is reasonable. When an assay does not have generalizable normal ranges, e.g. when run relative to a variable arbitrary reference and/or varying from machine to machine, then you really need a per record reference for the normal range and this table will be insufficiently granular and misleading. The spec reads 'Every record in this table should be unique.' but this is trivially true given each row has an arbitrary LABHISTORYID and uniqueness is otherwise undefined. New / Modified fields ------------------------ LAB_RESULT_CMRESULT_TEXT - Implementation concern - in MCW's experience SAS expands varchar columns to their maximum width, this will bloat table size if a column is sparsely populated with large records. Much more efficient would be a separate relational table with text results keyed by LAB_RESULT_CM_ID ENCOUNTERENCOUNTER_TYPE - No comment ENCOUNTERADMITTING_SOURCE - No comment CONDITIONCONDITION_SOURCE - Guidance on expected source of Chief Complaint would be useful, should it always be linked to an ENCOUNTER? CONDITIONRAW_CONDITION_TEXT - No comment OBS_CLINOBSCLIN_START_DATE - No comment OBS_CLINOBSCLIN_START_TIME - No comment OBS_CLINOBSCLIN_STOP_DATE - No comment OBS_CLINOBSCLIN_STOP_TIME - No comment OBS_CLINOBSCLIN_SOURCE - May be better to maintain EHR / Registry source distinction OBS_CLINOBSCLIN_ABN_IND - No comment OBS_GENOBSGEN_START_DATE - No comment OBS_GENOBSGEN_START_TIME - No comment OBS_GENOBSGEN_STOP_DATE - No comment OBS_GENOBSGEN_STOP_TIME - No comment OBS_GENOBSGEN_SOURCE - May be better to maintain EHR / Registry source distinction OBS_GENOBSGEN_ABN_IND - No comment OBS_GENOBSGEN_TABLE_MODIFIED - No comment HARVESTCDM_VERSION - No comment HARVESTTOKEN_ENCRYPTION_KEY - Is a better name TOKEN_ENCRYPTION_KEY_NAME ? - Please give an example in guidance HARVESTOBSCLIN_START_DATE_MGMT - No comment HARVESTOBSCLIN_STOP_DATE_MGMT - No comment HARVESTOBSGEN_START_DATE_MGMT - No comment HARVESTOBSGEN_STOP_DATE_MGMT - No comment Best regards, Alex Stoddard Programmer/Analyst Biomedical Informatics Clinical & Translational Science Institute Medical College of Wisconsin astodd...@mcw.edu I am currently working remotely -------------------------------------------------- _______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu https://urldefense.proofpoint.com/v2/url?u=http-3A__listserv.kumc.edu_mailman_listinfo_gpc-2Ddev&d=DwIFAw&c=II16XUCNF0uj2WHDMBdftpHZzyfqZU4E6o4J8m7Yfh-XF5deecOtjPXuMFvj1uWy&r=MwmdyHUR1MNPWZBi1oQ_Ksh4XI39nGu45nleZO875iA&m=ZgEr_8KiuJ9caTG5rYIXlYWcbHCYl2xRU1V7DOuK2Ok&s=sqDS7rNOmnY0GFXD95Vi3s8zTcYJA-GnOYQhJsxokUU&e= NOTICE: This e-mail is for the sole use of the intended recipient and may contain confidential and privileged information. If you are not the intended recipient, you are prohibited from reviewing, using, disclosing or distributing this e-mail or its contents. If you have received this e-mail in error, please contact the sender by reply e-mail and destroy all copies of this e-mail and its contents. _______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev
_______________________________________________ Gpc-dev mailing list Gpc-dev@listserv.kumc.edu http://listserv.kumc.edu/mailman/listinfo/gpc-dev