Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies?

Kean Kaufmann Wed, 07 Mar 2018 09:29:19 -0800

P.S. Extra config bit:  I also removed "CD" from the exclusionTags in the
UmlsOverlapLookupAnnotator.



On Wed, Mar 7, 2018 at 10:58 AM, Kean Kaufmann <[email protected]> wrote:

> Hi Sean,
>
> I'm perplexed. It seems as if the number of tokens that the
> UmlsOverlapLookupAnnotator will skip varies with the content of the
> RareWordDictionary.
>
> Here's my setup.  I think I've included enough information to replicate my
> perplexity, if you have time/inclination to do that; let me know if I've
> left anything out.
>
> I have a custom dictionary built from UMLS sources including SNOMEDCT_US:
>
> sql> select cui,text from cui_terms where text='chronic kidney disease' or
>> cui in (2316786,2316787);
>>     CUI  TEXT
>> -------  --------------------------------
>> 1561643  chronic kidney disease
>> 2316787  stage 3 chronic kidney disease
>> 2316787  chronic kidney disease stage 3
>> 2316787  chronic kidney disease , stage 3
>> 2316787  ckd stage 3
>> 2316786  chronic kidney disease stage 2
>> 2316786  chronic kidney disease , stage 2
>> 2316786  stage 2 chronic kidney disease
>> 2316786  ckd stage 2
>> Fetched 9 rows.
>> sql>
>
>
> My documents contain acronym expansions and Roman numerals for stages,
> like this:
>
> Problem List:
>> CKD (chronic kidney disease), stage II
>> Decubitus ulcer - grade II
>
>
> So I create a BSV RareWordDictionary to capture the Roman numerals.
> I don't want to have to guess at all the possible punctuation variations,
> so I try to make my entries as general as safely possible,
> using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
>
> I add dictionary and dictionaryConceptPair entries for my BSV file to
> cTakesHsql.xml as shown in the example/ directory, using
> SemanticCleanupTermConsumer as rareWordConsumer.
>
> Success! Now "chronic kidney disease), stage II" gets annotated as a
> DiseaseDisorderMention with CUI C2316786.
>
> But a couple of things confuse me.
>
> *1. Removing an entry*
>
> If I remove the other BSV entry, "chronic kidney disease III",
> "chronic kidney disease), stage II" isn't identified anymore:
> suddenly it only annotates "chronic kidney disease", with C1561643.
>
> *2. Adding an entry*
>
> My documents also have staging language for ulcers, e.g. "Decubitus ulcer
> - grade II".
>
> If I add an entry for this to my BSV dictionary, so now I have:
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
> C1720518|decubitus ulcer II
>
> and annotate this text:
>
> Problem List:
>> CKD (chronic kidney disease), stage II
>> Decubitus ulcer - grade II
>
>
> then "Decubitus ulcer - grade II" gets annotated as a
> DiseaseDisorderMention with C1720518, as hoped.
> But only "chronic kidney disease" is identified, as before... "stage II"
> gets left out.
>
> *3. Adding a comma*
>
> If I add an entry with a comma in it:
>
> C2316786|chronic kidney disease , II
>
> then "chronic kidney disease), stage II" gets picked up, no matter what.
>
> Without the comma entry, it's skipping three consecutive tokens... but
> sometimes it seems willing to do that, and sometimes it doesn't.
>
> Is this expected behavior?
> If so, can you help me understand what to expect?
> At this point I hesitate to add anything to the BSV dictionary!
>
> Many thanks,
> Kean
>
>

Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies?

Reply via email to