Johnsd11 commented on issue #56:
URL: https://github.com/apache/ctakes/issues/56#issuecomment-2835961166
> I am looking for a NLP to read pathology reports and extract cancer
> related site, histology, stage and any other DX/RX data available. In
> looking at CTakes, I have a few questions;
>
> - Is CTakes an appropriate tool to automate this task?
I wrote a commercial surgical-pathology coding module some years ago, and
could imagine doing it in cTAKES.
Here's my two cents to add to the wealth of information Peter has already
provided.
Best luck.
> Where can I find an "executive overview" (30,000 foot view) of how the
CTakes works?
As Peter said, there's a lot of documentation out there!
Videos here: https://ctakes.apache.org/tutorials.html
Key point: it's built on top of UIMA https://uima.apache.org//
which ingests and annotates data from any source, letting you mix, match
and create your own annotators to build chains of analyses.
The cTAKES value-adds include a clinical type system and a spiffy
dictionary (see below).
> My ignorance regarding NLP algorithms like CTakes is whether it is
keyword driven, or it is self learning.
cTAKES is *not* "self-learning"; you have to tell it exactly what
information you want to extract from where.
Pro: High precision; explainable; you won't get the right answer for the
wrong reason.
Con: Low recall; brittle; you may not get answers at all! If you're
processing unpredictable document formats from many different facilities,
it can be hard to generalize over them.
> I currently have a homegrown application which looks for keywords and
negation modifiers within a certain distance from the keywords
cTAKES can certainly help with that.
-
*Keywords *cTAKES lets you use the NLM's UMLS Metathesaurus
<https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html>,
using the dictionary framework Peter mentioned:
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Fast+Dictionary+Lookup
These sources may be useful in building your custom dictionary:
- the NCI Thesaurus:
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NCI/index.html
- CPT, if you want codes from there:
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CPT/index.html
- For anatomy, I'm not familiar with the "anatomical site annotator"
Peter alludes to, but the FMA is better structured than SNOMED:
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/FMA/index.html
- *Negation*
Several annotators available:
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Negation+Annotators
Distance-from-keywords is a start, but sentence detection and shallow
parsing both help.
I like the ctakes-ytex-uima NegexAnnotator and SentenceDetector.
-
*Document structure *I found header detection to be crucial in processing
pathology reports:
tracking specimens through a document, extracting tumor info from
tables, etc.
The cTAKES RegexSectionizer might work for you.
https://ctakes.apache.org/apidocs/4.0.0/org/apache/ctakes/core/ae/RegexSectionizer.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]