Hi James, >> Will the new dictionary lookup use the canonicalForm?
It does use WordToken.getCanonicalForm(); Usually this seems to be empty, but as long as it is present it will be used. -----Original Message----- From: andy mcmurry [mailto:mcmurry.a...@gmail.com] Sent: Tuesday, April 22, 2014 4:23 AM To: dev@ctakes.apache.org Subject: Re: new dictionary lookup {was RE: lvg entries] Highly Relevant *DNorm: disease name normalization* http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/ "Disease names are often created by combining roots and affixes from Greek or Latin (e.g. ‘hemochromatosis’)" .... On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. <masanz.ja...@mayo.edu>wrote: > Sean, > > Will the new dictionary lookup use the canonicalForm? If not, perhaps > you can remove LVG from at least some of the pipelines (drug-ner does > not include the dependency parser) > > -----Original Message----- > From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu] > Sent: Thursday, April 17, 2014 12:52 PM > To: dev@ctakes.apache.org > Subject: RE: lvg entries > > Those variants are not used by the dictionary lookup. I did look at > them to see if it was worthwhile for the new dictionary, but they are > all over the place so I passed. > ________________________________________ > From: Miller, Timothy [timothy.mil...@childrens.harvard.edu] > Sent: Thursday, April 17, 2014 1:25 PM > To: dev@ctakes.apache.org > Subject: Re: lvg entries > > Pei and I had a similar discussion in person -- mapping from lexical > variants to a stem might be useful. Pei also mentioned that one > intended use might have been searching the dictionary with lexical > variants, but I don't think that is done. Looking at the precision of > the variants, I think its highly unlikely the speed tradeoff would be > worth any improvements in recall. > > Finally, at least in eclipse doing a search on references to the > method to retrieve the lemma entries turns up nothing. > > Tim > > > On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote: > > I don't know of any applications within cTAKES that make use of this... > The reverse (mapping from these "variants" to the normal form) may be > useful though. > > > > Dima > > > > > > > > > > On Apr 17, 2014, at 11:50, Miller, Timothy < > timothy.mil...@childrens.harvard.edu> wrote: > > > >> Sure, just as an example, I gave it a note with about 1000 words. > >> It generates 11500 NonEmptyFSList elements (each is basically one > >> lexical variant). > >> > >> For the word "symptomatic", these are the first 10 of 20 lexical > variants: > >> Symptomaticer/JJ > >> Symptomaticer/RB > >> Symptomaticed/VB > >> Symptomaticcing/VB > >> Symptomatics/VB > >> Symptomatics/NN > >> Symptomaticked/VB > >> Symptomatic/VB > >> Symptomatic/JJ > >> Symptomatic/RB > >> > >> Tim > >> > >> > >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote: > >>> Tim, this is a very interesting observation. Could you please send > >>> a > few examples of what LVG generates? Both sensical and non :) > >>> > >>> Dima > >>> > >>> > >>> > >>> > >>> On Apr 17, 2014, at 11:28, Miller, Timothy < > timothy.mil...@childrens.harvard.edu> wrote: > >>> > >>>> The LVG annotator creates an enormous number of "lemmas" for > >>>> every WordToken in the CAS, and I'm wondering what the original > >>>> purpose > was? I > >>>> think this is probably a minor bottleneck for speed but mostly a > pretty > >>>> big space hog (at least 50% of the space of xmi files in my tests). > >>>> > >>>> As of right now I'm not sure if any downstream components are > >>>> using these lemmas, and on a manual inspection the precision > >>>> seems to be pretty abysmal (meaning most of them are nonsensical > >>>> as lexical variants), so as I said, just wondering if we can > >>>> revisit why cTAKES generates so many and whether that component can be > >>>> optimized. > >>>> > >>>> Thanks > >>>> Tim > >>>> > > > >