Wow, great work. Thank you for sharing.
John Green — Sent from Mailbox On Thu, Dec 18, 2014 at 6:08 PM, Bruce Tietjen <bruce.tiet...@perfectsearchcorp.com> wrote: > Actually, we are working on a similar tool to compare it to the human > adjudicated standard for the set we tested against. I didn't mention it > before because the tool isn't complete yet, but initial results for the set > (excluding those marked as "CUI-less") was as follows: > Human adjudicated annotations: 4591 (excluding CUI-less) > Annotations found matching the human adjudicated standard > UMLSProcessor 2245 > FastUMLSProcessor 215 > [image: IMAT Solutions] <http://imatsolutions.com> > Bruce Tietjen > Senior Software Engineer > [image: Mobile:] 801.634.1547 > bruce.tiet...@imatsolutions.com > On Thu, Dec 18, 2014 at 3:37 PM, Chen, Pei <pei.c...@childrens.harvard.edu> > wrote: >> >> Bruce, >> Thanks for this-- very useful. >> Perhaps Sean Finan comment more- >> but it's also probably worth it to compare to an adjudicated human >> annotated gold standard. >> >> --Pei >> >> -----Original Message----- >> From: Bruce Tietjen [mailto:bruce.tiet...@perfectsearchcorp.com] >> Sent: Thursday, December 18, 2014 1:45 PM >> To: dev@ctakes.apache.org >> Subject: cTakes Annotation Comparison >> >> With the recent release of cTakes 3.2.1, we were very interested in >> checking for any differences in annotations between using the >> AggregatePlaintextUMLSProcessor pipeline and the >> AggregatePlanetextFastUMLSProcessor pipeline within this release of cTakes >> with its associated set of UMLS resources. >> >> We chose to use the SHARE 14-a-b Training data that consists of 199 >> documents (Discharge 61, ECG 54, Echo 42 and Radiology 42) as the basis >> for the comparison. >> >> We decided to share a summary of the results with the development >> community. >> >> Documents Processed: 199 >> >> Processing Time: >> UMLSProcessor 2,439 seconds >> FastUMLSProcessor 1,837 seconds >> >> Total Annotations Reported: >> UMLSProcessor 20,365 annotations >> FastUMLSProcessor 8,284 annotations >> >> >> Annotation Comparisons: >> Annotations common to both sets: 3,940 >> Annotations reported only by the UMLSProcessor: 16,425 >> Annotations reported only by the FastUMLSProcessor: 4,344 >> >> >> If anyone is interested, following was our test procedure: >> >> We used the UIMA CPE to process the document set twice, once using the >> AggregatePlaintextUMLSProcessor pipeline and once using the >> AggregatePlaintextFastUMLSProcessor pipeline. We used the WriteCAStoFile >> CAS consumer to write the results to output files. >> >> We used a tool we recently developed to analyze and compare the >> annotations generated by the two pipelines. The tool compares the two >> outputs for each file and reports any differences in the annotations >> (MedicationMention, SignSymptomMention, ProcedureMention, >> AnatomicalSiteMention, and >> DiseaseDisorderMention) between the two output sets. The tool reports the >> number of 'matches' and 'misses' between each annotation set. A 'match' is >> defined as the presence of an identified source text interval with its >> associated CUI appearing in both annotation sets. A 'miss' is defined as >> the presence of an identified source text interval and its associated CUI >> in one annotation set, but no matching identified source text interval and >> CUI in the other. The tool also reports the total number of annotations >> (source text intervals with associated CUIs) reported in each annotation >> set. The compare tool is in our GitHub repository at >> https://github.com/perfectsearch/cTAKES-compare >>