I've finished another pass through the reader that takes the SHARP Knowtator data and reads it into the cTAKES UIMA type system. The class is:
org.apache.ctakes.core.ae.SHARPKnowtatorXMLReader If you take a look at that, you'll see a ton of TODO notes and warnings, where I couldn't figure out how to map the Knowtator annotations to the cTAKES UIMA annotations. Here's a list of issues: * I couldn't find an entity type for "Clinical_attribute", "Devices", "Lab", "Phenomena" * I couldn't find a modifier type (or alternatively, an Annotation subclass) for the Knowtator annotations "generic_class", "conditional_class", "uncertainty_indicator_class", "distal_or_proximal", "Person", "negation_indicator_class", "historyOf_indicator_class", "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral", "method_class", "device_class", "allergy_indicator_class", "Route", "Form", "Strength", "Strength number", "Strength unit", "Frequency", "Frequency number", "Frequency unit", "Value", "Value number", "Value unit", "estimated_flag_indicator", "reference_range", "Date", "Status change", "Duration", "Dosage". * I couldn't find a place for the normalized value of "generic_class", "conditional_class", "uncertainty_indicator_class", "distal_or_proximal", "Person", "negation_indicator_class", "superior_or_inferior", "medial_or_lateral", "dorsal_or_ventral", "device_class", "allergy_indicator_class", "lab_interpretation_indicator", "estimated_flag_indicator" * I couldn't find a place for the "associatedCode" of a "Person" or "historyOf_indicator_class" * There were several things in the Knowtator annotations that I couldn't even guess what they meant: "Attributes_lab", "Temporal", ":THING", "Entities". After working with this data I think we should consider having separate UIMA Annotation sub-types for each of the things that are Modifiers now. For example, if we have a real Severity Annotation for textual mentions of severity, then the CAS makes it easy to select these. We have exactly this use case in relation extractor - we need just the Severity modifiers, excluding all the other modifiers. Basically, I think the principle we should follow in UIMA is: "If you could imagine searching the CAS for something, then that something should have it's own Annotation sub-type." So, I think we need Annotation sub-types (not TOP sub-types) for: // linguistic phenonmena Generic Conditional Negation Uncertainty Estimated HistoryOf Person // for disease/disorder/sign/symptom Course BodyLaterality (covering distal_or_proximal, superior_or_inferior, etc.) BodySide // for procedure ProcedureMethod ProcedureDevice // for medication MedicationAllergyIndicator MedicationDosage MedicationDuration MedicationForm MedicationFrequency MedicationRoute MedicationStartDate (maybe?) MedicationStatusChange MedicationStrength // for lab LabValue LabInterpretation LabReferenceRange Steve P.S. SHARPKnowtatorXMLReader can parse all the UMLS_CEM data that's on the cloud right now. So once all these type system issues get sorted out, it should be pretty much ready to go.
