I've spent the last a few months working on a clinical NLP project using
cTAKES. It's a very complex system to me and every time I dig into it some
new discoveries will come out. Since last week, I tried to figure out which
analysis engine can help to do a good job to consider cases like negation,
family history, uncertainty, etc. By now, I had some experience and would
like to share with the community.
The best combination for me is to use assertionMiniPipelineAnalysisEngine
for negation, uncertainty, generic and subject detection, and
HistoryCleartkAnalysisEngine for history detection. Both engines are in
desc/ctakes-assertion folder. The assertionMiniPipelineAnalysisEngine also
claims to be useful for conditional detection, which I haven't verified
using my test files yet.
I'm using the AggregatePlaintextFastUMLSProcessor on the higher level. The
default analysis engines in AggregatePlaintextFastUMLSProcessor for
negation, uncertainty, generic, etc. are StatusAnnotator +
NegationAnnotator + PolarityCleartkAnalysisEngine +
SubjectCleartkAnalysisEngine + UncertaintyCleartkAnalysisEngine +
GenericCleartkAnalysisEngine + HistoryCleartkAnalysisEngine. It looks like
in the node part, StatusAnnotator and NegationAnnotator are commented out,
so only the remaining five analysis engines are actually used and all of
them are in the same desc/ctakes-assertion folder. These five analysis
engines were not effective in my test files and I'm still confused by their
relationship to the assertionaAnalysisEngine,
conceptConverterAnalysisEngine, GenericAttributeAnalysisEngine and
SubjectAttributeAnalysisEngine used in assertionMiniPipelineAnalysisEngine.
It looks to me the Clear in their names indicate something but I couldn't
figure it out without going through the java code, which I intend not to do
at this level.
That's pretty much all of it for now. Anyone familiar with this topic are
welcome to jump in to provide my insights or correction. Hopefully, we can
have a nice discussion that can be useful to other users and developers.
ps. The reason for using AggregatePlaintextFastUMLSProcessor rather than
AggregatePlaintextProcessor is that I find the preferred words property in
the former very useful while it can't be detected using the latter.
Yiming Zuo <https://sites.google.com/site/yimingzuo/>
Georgetown U. Medical Center:
Dr. Ressom's Omics Lab <http://omics.georgetown.edu/>
ECE Department of Virginia Tech:
Computational Bioinformatics & Bio-imaging Laboratory