Hi Sean and Guergana, Thanks for your reply about the fast and non-fast dictionary look-up, and the testing dataset. Originally, I thought the fast annotator is fast because it only takes a portion of the whole dictionary. Now I realize the fast annotator is the more powerful one. That's very helpful.
For Guergana, Were you also trying to attach the exec summary? I couldn't see it from the email. Best, Yiming On Wed, Oct 19, 2016 at 1:03 PM, Savova, Guergana < guergana.sav...@childrens.harvard.edu> wrote: > Hi Yiming, > Re your question about gold standard datasets. In parallel with releasing > best performing methods in cTAKES, we have generated several gold standard > datesets. Our plan is to start distributing them through a unified effort > -- a health NLP Center. See attached exec summary. We hope to have the > Center running in the very near future. > > Cheers, > --Guergana > > -----Original Message----- > From: Zuo Yiming [mailto:yiming...@gmail.com] > Sent: Wednesday, October 19, 2016 12:22 PM > To: dev@ctakes.apache.org > Subject: Re: Best combination of analysis engines to consider negation, > family history, uncertainty, etc. > > Hi Sean and Timothy, > > Thanks for your clarification about ClearTK tools. I'm amazed by the power > of cTAKES and the resource and community you guys take efforts to built. I > will certainly be happy to provide more feedback as my project moves on. > > For Timothy, > > By rule-based system, do you refer to the assertion annotator? How about > the old negation annotator and the status annotator, are they also > ruled-based system? I got a feeling that assertion annotator and ClearTK > system are more favored than negation annotator and the status annotator > for some reason in cTAKES right now. > > Regarding ClearTK system on my test files, the negation, history, > uncertainty modules work just fine as the assertion annotator. My test > files are only a few, so it's really hard to tell which one is better. The > main difference comes when detecting subject and generic property. On my > limited test files, ClearTK system doesn't work at all. It will assign > patient as the subject for all detected phrases when it's the patient's > family member who have diabetes. The same problem goes to the generic > property, ClearTK system assigns false as the generic property for all > detected phrases. The paper mentioned by you and Sean seems interesting, I > will take a look later. > > As for further questions, can you guys give me some suggestions where to > find public golden standard datasets so I can actually conduct some > independent evaluation of cTAKES by metrics like precision/recall and F1 > score? > > At last, a minor suggestion from the user perspective will be to add the > preferred words property to the AggregatePlaintextUMLSProcessor. Like I > pointed out briefly in my first email, using > AggregatePlaintextFastUMLSProcessor > we can get the preferred words for detected phrases but not > AggregatePlaintextUMLSProcessor. This is very helpful when the detected > phrases are acronyms such as pt for patient. From my experience, > AggregatePlaintextUMLSProcessor tend to detect more clinical relevant > phrases compared with AggregatePlaintextFastUMLSProcessor. It will be > really nice if we can have the same preferred words property in > AggregatePlaintextUMLSProcessor in future cTAKES release. > > Best, > Yiming > > On Wed, Oct 19, 2016 at 11:11 AM, Miller, Timothy < > timothy.mil...@childrens.harvard.edu> wrote: > > > I can second Sean's thank you, it is good to have this feedback. The > > ClearTK machine learning models were made the default after we ran > > some experiments that found it performed better across a range of > > standard datasets than rule-based algorithms or the existing cTAKES > > module ( https://urldefense.proofpoint.com/v2/url?u=http-3A__ > journals.plos.org_plosone_article-3Fid-3D10.1371_ > journal.pone.0112774&d=DQIBaQ&c=qS4goWBT7poplM69zy_ > 3xhKwEW14JZMSdioCoppxeFU&r=SeLHlpmrGNnJ9mI2WCgf_wwQk9zL4aIrVmfBoSi- > j0kfEcrO4yRGmRCJNAr-rCmP&m=h2xGj7JrNP5pTj6fU4IE9EdNfbJZ0F > kOk3swxGR91E4&s=9b891QWT_DEckn4f25-xn3W32qkz8UoOw61qKAOqpK0&e= ). > > Since making them the default, though, we have heard from people and > > had our own experience conflict with those experiments. And certainly > > the errors in the rule-based system are easier to understand. > > > > Just curious, are you able to characterize the errors you see from the > > ClearTK system? I did some experiments recently on a new dataset > > comparing negex with the cleartk negation module and found that there > > was a precision/recall tradeoff but almost identical F1 scores. But > > for that dataset the tradeoff negex provided was preferred by our > > collaborators. (I think negex had better recall of negated terms but > worse precision). > > > > Tim > > > > > > > > ________________________________________ > > From: Finan, Sean <sean.fi...@childrens.harvard.edu> > > Sent: Wednesday, October 19, 2016 10:53 AM > > To: dev@ctakes.apache.org > > Subject: RE: Best combination of analysis engines to consider > > negation, family history, uncertainty, etc. > > > > Hi Yiming, > > > > > > > > Thank you very much for letting the community know what has and has > > not worked for you. I have also had better results with the Assertion > > annotators than the ClearTk alternatives, but that could be because of > > the note types/formats that I am using. > > > > > > > > Regarding the "Clear" in names, it is because ClearTk (Clear ToolKit) > > is used to train machine learning models for detection of the > > indicated property. You can find information on ClearTk starting here: > > https://urldefense.proofpoint.com/v2/url?u=http-3A__clear. > > colorado.edu_compsem_&d=DQIGaQ&c=qS4goWBT7poplM69zy_ > > 3xhKwEW14JZMSdioCoppxeFU&r=Heup-IbsIg9Q1TPOylpP9FE4GTK- > > OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk0CH-2UrNpH0F4PgdnzixY- > > xVsh8OYTCP8mhe27Gw&s=0mEmiKK5adFN2YCkYyNCNM3Cv4FNWlMbN8XU6GtcQP4&e= > > > > > > > > If you prefer to read a paper, you can check out > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www. > > lrec-2Dconf.org_proceedings_lrec2014_pdf_218-5FPaper.pdf& > > d=DQIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r= > > Heup-IbsIg9Q1TPOylpP9FE4GTK-OqdTDRRNQXipowRLRjx0ibQrHEo8uYx6674h&m=aRk > > 0CH- > > 2UrNpH0F4PgdnzixY-xVsh8OYTCP8mhe27Gw&s=T-pZCKB6BckhHzvYc9gyutCmKQlhitd > > O > > _-i4e387tjM&e= > > > > > > > > Others no the devlist can provide much more information than can I, so > > you could post a question if you like. > > > > > > > > Cheers, > > > > Sean > > > > > > > > -----Original Message----- > > > > From: Zuo Yiming [mailto:yiming...@gmail.com] > > > > Sent: Wednesday, October 19, 2016 10:04 AM > > > > To: u...@ctakes.apache.org; dev@ctakes.apache.org > > > > Subject: Best combination of analysis engines to consider negation, > > family history, uncertainty, etc. > > > > > > > > Hi everyone, > > > > > > > > I've spent the last a few months working on a clinical NLP project > > using cTAKES. It's a very complex system to me and every time I dig > > into it some new discoveries will come out. Since last week, I tried > > to figure out which analysis engine can help to do a good job to > > consider cases like negation, family history, uncertainty, etc. By > > now, I had some experience and would like to share with the community. > > > > > > > > The best combination for me is to use > > assertionMiniPipelineAnalysisEngine > > > > for negation, uncertainty, generic and subject detection, and > > HistoryCleartkAnalysisEngine for history detection. Both engines are > > in desc/ctakes-assertion folder. The > > assertionMiniPipelineAnalysisEngine > > also claims to be useful for conditional detection, which I haven't > > verified using my test files yet. > > > > > > > > I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. > > The default analysis engines in AggregatePlaintextFastUMLSProcessor > > for negation, uncertainty, generic, etc. are StatusAnnotator + > > NegationAnnotator + PolarityCleartkAnalysisEngine + > > SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine + > > GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks > > like in the node part, StatusAnnotator and NegationAnnotator are > > commented out, so only the remaining five analysis engines are > > actually used and all of them are in the same desc/ctakes-assertion > > folder. These five analysis engines were not effective in my test > > files and I'm still confused by their relationship to the > > assertionaAnalysisEngine, conceptConverterAnalysisEngine, > > GenericAttributeAnalysisEngine and SubjectAttributeAnalysisEngine used > in assertionMiniPipelineAnalysisEngine. > > > > It looks to me the Clear in their names indicate something but I > > couldn't figure it out without going through the java code, which I > > intend not to do at this level. > > > > > > > > That's pretty much all of it for now. Anyone familiar with this topic > > are welcome to jump in to provide my insights or correction. > > Hopefully, we can have a nice discussion that can be useful to other > users and developers. > > > > > > > > ps. The reason for using AggregatePlaintextFastUMLSProcessor rather > > than AggregatePlaintextProcessor is that I find the preferred words > > property in the former very useful while it can't be detected using the > latter. > > > > > > > > Best, > > > > Yiming > >