Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "cTAKESParser" page has been changed by ChrisMattmann: https://wiki.apache.org/tika/cTAKESParser?action=diff&rev1=2&rev2=3 = Signing up for a UMLS account = To use cTAKES and the cTAKES Tika Parser you need a Unified Medical Language System (UMLS) account. - You can sign up for one [[https://uts.nlm.nih.gov/home.html|here]]. + You can sign up for one [[https://uts.nlm.nih.gov/home.html|here]]. It can take up to 3 business + days to get an account so be patient. Once your account is approved you can use the cTAKESParser + and read on. Future improvements are to provide a means to include the offline UMLS dictionary. + = Prepare your CTAKES configuration properties file = + + The cTAKESParser requires a configuration properties file. You can find an example [[https://issues.apache.org/jira/secure/attachment/12737116/CTAKESConfig.properties|here]] on [[https://issues.apache.org/jira/browse/TIKA-1645|TIKA-1645]]. + + Edit it as follows (expand $HOME below with your actual home path). + + {{{ + aeDescriptorPath=$HOME/desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml + text=false + annotationProps=BEGIN,END,ONTOLOGY_CONCEPT_ARR + separatorChar=: + metadata=Study Title,Study Description + UMLSUser=your_username + UMLPass=your_password + }}} + + Analysis is performed on the extracted text and/or metadata from the AutoDetectParser. cTAKESParser decorates AutoDetectParser, and then takes the extracted metadata and/or text (or both) and then adds ctakes: prefixed metadata for procedure, medication, disease and other extracted information. + To use the cTAKESParser, update the metadata property to be a comma separated list of metadata fields to search for medical terminology in. Then, if you would like the parser to also search the extracted text from Tika, set `text=true`. + + The `annotationProps` is a comma separated list of what cTAKES properties to extract, and `separatorChar` is what to use to separate them in the extracted field. So, we are telling cTAKES to extract the begin and end of the found text (BEGIN,END), first, and then extract the Ontology concept array (ONTOLOGY_CONCEPT_ARR). UMLS uses identifiers for each term, that can be used to then search UMLS for more information about that term - this array includes the UMLS pointer to the term, and any also identified similar terms. An example of the extracted annotationProps would be: + + `mantle cell lymphoma:40592:40612:C0334634,C0334634,C0334634,C0334634` + + In this example, a cTAKES DiseaseDisorderMention of `mantle cell lymphoma` is identified, and then its associated annotation props (the text begins at position 40592 and ends at position 40612 (could be used for highlighting), and then an associated array of medical ontology concept identifiers are provided, i.e., C0334634,C0334634,C0334634,C0334634. +
