Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Zuo Yiming Thu, 20 Oct 2016 06:56:40 -0700

Hi Sean and Guergana,

Thanks for your reply about the fast and non-fast dictionary look-up, and
the testing dataset. Originally, I thought the fast annotator is fast
because it only takes a portion of the whole dictionary. Now I realize the
fast annotator is the more powerful one. That's very helpful.


For Guergana,

Were you also trying to attach the exec summary? I couldn't see it from the
email.

Best,
Yiming

On Wed, Oct 19, 2016 at 1:03 PM, Savova, Guergana <
guergana.sav...@childrens.harvard.edu> wrote:

> Hi Yiming,
> Re your question about gold standard datasets. In parallel with releasing
> best performing methods in cTAKES, we have generated several gold standard
> datesets. Our plan is to start distributing them through a unified effort
> -- a health NLP Center. See attached exec summary. We hope to have the
> Center running in the very near future.
>
> Cheers,
> --Guergana
>
> -----Original Message-----
> From: Zuo Yiming [mailto:yiming...@gmail.com]
> Sent: Wednesday, October 19, 2016 12:22 PM
> To: dev@ctakes.apache.org
> Subject: Re: Best combination of analysis engines to consider negation,
> family history, uncertainty, etc.
>
> Hi Sean and Timothy,
>
> Thanks for your clarification about ClearTK tools. I'm amazed by the power
> of cTAKES and the resource and community you guys take efforts to built. I
> will certainly be happy to provide more feedback as my project moves on.
>
> For Timothy,
>
> By rule-based system, do you refer to the assertion annotator? How about
> the old negation annotator and the status annotator, are they also
> ruled-based system? I got a feeling that assertion annotator and ClearTK
> system are more favored than negation annotator and the status annotator
> for some reason in cTAKES right now.
>
> Regarding ClearTK system on my test files, the negation, history,
> uncertainty modules work just fine as the assertion annotator. My test
> files are only a few, so it's really hard to tell which one is better. The
> main difference comes when detecting subject and generic property. On my
> limited test files, ClearTK system doesn't work at all. It will assign
> patient as the subject for all detected phrases when it's the patient's
> family member who have diabetes. The same problem goes to the generic
> property, ClearTK system assigns false as the generic property for all
> detected phrases. The paper mentioned by you and Sean seems interesting, I
> will take a look later.
>
> As for further questions, can you guys give me some suggestions where to
> find public golden standard datasets so I can actually conduct some
> independent evaluation of cTAKES by metrics like precision/recall and F1
> score?
>
> At last, a minor suggestion from the user perspective will be to add the
> preferred words property to the AggregatePlaintextUMLSProcessor. Like I
> pointed out briefly in my first email, using 
> AggregatePlaintextFastUMLSProcessor
> we can get the preferred words for detected phrases but not
> AggregatePlaintextUMLSProcessor. This is very helpful when the detected
> phrases are acronyms such as pt for patient. From my experience,
> AggregatePlaintextUMLSProcessor tend to detect more clinical relevant
> phrases compared with AggregatePlaintextFastUMLSProcessor. It will be
> really nice if we can have the same preferred words property in
> AggregatePlaintextUMLSProcessor in future cTAKES release.
>
> Best,
> Yiming
>
> On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy <
> timothy.mil...@childrens.harvard.edu> wrote:
>
> > I can second Sean's thank you, it is good to have this feedback. The
> > ClearTK machine learning models were made the default after we ran
> > some experiments that found it performed better across a range of
> > standard datasets than rule-based algorithms or the existing cTAKES
> > module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__
> journals.plos.org_plosone_article-3Fid-3D10.1371_
> journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_
> 3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi-
> j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0F
> kOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ).
> > Since making them the default, though, we have heard from people and
> > had our own experience conflict with those experiments. And certainly
> > the errors in the rule-based system are easier to understand.
> >
> > Just curious, are you able to characterize the errors you see from the
> > ClearTK system? I did some experiments recently on a new dataset
> > comparing negex with the cleartk negation module and found that there
> > was a precision/recall tradeoff but almost identical F1 scores. But
> > for that dataset the tradeoff negex provided was preferred by our
> > collaborators. (I think negex had better recall of negated terms but
> worse precision).
> >
> > Tim
> >
> >
> >
> > ________________________________________
> > From: Finan, Sean <sean.fi...@childrens.harvard.edu>
> > Sent: Wednesday, October 19, 2016 10:53 AM
> > To: dev@ctakes.apache.org
> > Subject: RE: Best combination of analysis engines to consider
> > negation, family history, uncertainty, etc.
> >
> > Hi Yiming,
> >
> >
> >
> > Thank you very much for letting the community know what has and has
> > not worked for you.  I have also had better results with the Assertion
> > annotators than the ClearTk alternatives, but that could be because of
> > the note types/formats that I am using.
> >
> >
> >
> > Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit)
> > is used to train machine learning models for detection of the
> > indicated property.  You can find information on ClearTk starting here:
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__clear.
> > colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_
> > 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK-
> > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY-
> > xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e=
> >
> >
> >
> > If you prefer to read a paper, you can check out
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.
> > lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf&
> > d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=
> > Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk
> > 0CH-
> > 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd
> > O
> > _-i4e387tjM&e=
> >
> >
> >
> > Others no the devlist can provide much more information than can I, so
> > you could post a question if you like.
> >
> >
> >
> > Cheers,
> >
> > Sean
> >
> >
> >
> > -----Original Message-----
> >
> > From: Zuo Yiming [mailto:yiming...@gmail.com]
> >
> > Sent: Wednesday, October 19, 2016 10:04 AM
> >
> > To: u...@ctakes.apache.org; dev@ctakes.apache.org
> >
> > Subject: Best combination of analysis engines to consider negation,
> > family history, uncertainty, etc.
> >
> >
> >
> > Hi everyone,
> >
> >
> >
> > I've spent the last a few months working on a clinical NLP project
> > using cTAKES. It's a very complex system to me and every time I dig
> > into it some new discoveries will come out. Since last week, I tried
> > to figure out which analysis engine can help to do a good job to
> > consider cases like negation, family history, uncertainty, etc. By
> > now, I had some experience and would like to share with the community.
> >
> >
> >
> > The best combination for me is to use
> > assertionMiniPipelineAnalysisEngine
> >
> > for negation, uncertainty, generic and subject detection, and
> > HistoryCleartkAnalysisEngine for history detection. Both engines are
> > in desc/ctakes-assertion folder. The
> > assertionMiniPipelineAnalysisEngine
> > also claims to be useful for conditional detection, which I haven't
> > verified using my test files yet.
> >
> >
> >
> > I'm using the AggregatePlaintextFastUMLSProcessor on the higher level.
> > The default analysis engines in AggregatePlaintextFastUMLSProcessor
> > for negation, uncertainty, generic, etc. are StatusAnnotator +
> > NegationAnnotator + PolarityCleartkAnalysisEngine +
> > SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine +
> > GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks
> > like in the node part, StatusAnnotator and NegationAnnotator are
> > commented out, so only the remaining five analysis engines are
> > actually used and all of them are in the same desc/ctakes-assertion
> > folder. These five analysis engines were not effective in my test
> > files and I'm still confused by their relationship to the
> > assertionaAnalysisEngine, conceptConverterAnalysisEngine,
> > GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used
> in assertionMiniPipelineAnalysisEngine.
> >
> > It looks to me the Clear in their names indicate something but I
> > couldn't figure it out without going through the java code, which I
> > intend not to do at this level.
> >
> >
> >
> > That's pretty much all of it for now. Anyone familiar with this topic
> > are welcome to jump in to provide my insights or correction.
> > Hopefully, we can have a nice discussion that can be useful to other
> users and developers.
> >
> >
> >
> > ps. The reason for using AggregatePlaintextFastUMLSProcessor rather
> > than AggregatePlaintextProcessor is that I find the preferred words
> > property in the former very useful while it can't be detected using the
> latter.
> >
> >
> >
> > Best,
> >
> > Yiming
>
>

Re: Best combination of analysis engines to consider negation, family history, uncertainty, etc.

Reply via email to