RE: Ctakes to process 5000K recoreds

Finan, Sean Tue, 09 Sep 2014 14:25:17 -0700

There is a tool to generate a dictionary in the new format using the UMLS MR*** 
files.


The module can also read directly from a file with bar-separated values:  
CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.

I can send a copy of the dictionary creator jar and instructions tomorrow.

Sean
________________________________________
From: Bruce Tietjen [[email protected]]
Sent: Tuesday, September 09, 2014 5:17 PM
To: [email protected]
Subject: Re: Ctakes to process 5000K recoreds

Sean,

If that is a script for generating a dictionary for use with
dictionary-lookup-fast, I would also be very interested in checking it out.

Thanks,

Bruce


 [image: IMAT Solutions] <http://imatsolutions.com>
 Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
[email protected]

On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
[email protected]> wrote:

> Great. I will do that. Thanks again.
>
> Nick
>
> -----Original Message-----
> From: Finan, Sean [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 4:39 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> Just use it with cTakes.  Instead of removing other modules from the
> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>
> For the
> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
> , you would modify:
>
>     <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>       <import
> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>     </delegateAnalysisEngine>
>
> To be:
>
>     <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>       <import
> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>     </delegateAnalysisEngine>
>
>
> That should be it.  You can then leave the rest of the module
> specifications alone.
>
> Sean
>
> ________________________________________
> From: Nick Nikandish [[email protected]]
> Sent: Tuesday, September 09, 2014 4:32 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> Hi Sean,
>
> Many thanks, I will try it tomorrow. Do you have any special instruction
> to run that scrip or I have to use it with cTakes?
>
> Thanks,
> Nick
>
> -----Original Message-----
> From: Finan, Sean [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 4:24 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> Hi Nick,
>
> I think that the bottleneck is probably the lookup module itself.  So, I
> just sent you a secure email/ftp link.  It contains a build of the new
> dictionary-lookup-fast module.  Should you choose to try it, let me know
> how things turn out.
>
> Sean
> ________________________________________
> From: Nick Nikandish [[email protected]]
> Sent: Tuesday, September 09, 2014 4:10 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> Thanks, let me try it.
> Nick
>
> -----Original Message-----
> From: Masanz, James J. [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 4:08 PM
> To: '[email protected]'
> Subject: RE: Ctakes to process 5000K recoreds
>
> If you just need the medication names, you can remove these:
>  <node>ContextDependentTokenizerAnnotator</node>
>  <node>DependencyParser</node>
>  <node>AssertionAnnotator</node>
>
> You might be able to get rid of the LvgAnnotator and still get decent
> results since variations of word form should not affect medication names. I
> would try with it and without it on a smaller set of files and see if you
> see a difference.
>
> I believe the others are needed by the default configs for medication
> lookup. For example, POS is used to get phrase type. Phrases are used to
> remove verb phrases from the lookup and also therefore to keep the lookup
> windows from getting too big.  I'm more familiar with the other types of
> named entities (diseases, symptoms, etc) than with medications.
>
> -----Original Message-----
> From: Nick Nikandish [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 3:01 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> James,
>
> Do you have any suggestion about running cTakes with minimum annotators
> that can return Medications in DictionaryLookupAnnotator?
> Thanks,
> Nick
>
> -----Original Message-----
> From: Masanz, James J. [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 3:05 PM
> To: '[email protected]'
> Subject: RE: Ctakes to process 5000K recoreds
>
> I suspect that when you take out simple segment annotated, nothing is
> getting processed, and that is why it appears so fast. At least some of the
> annotators loop through the list of sections/segments, which is why there
> is a simple segment annotator - so that there is at least one
> section/segment identified. Are you getting any annotations at all?
>
> -----Original Message-----
> From: Nick Nikandish [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 2:02 PM
> To: [email protected]
> Subject: RE: Ctakes to process 5000K recoreds
>
> Pei,
> I need the name of the medications for the application that I wrote and
> uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in
> performLookup()) and use them in my program but when I have
> SimpleSegementAnnotator it just takes forever. After taking
> SimpleSegementAnnotator out, no medication name in
> DictionaryLookupAnnotator is returned in the code. So I was wondering if
> there was a way that I could eliminate SimpleSegementAnnotator but still
> be  able to get the medications name in that class?
>
> Nick
>
> -----Original Message-----
> From: Pei Chen [mailto:[email protected]]
> Sent: Tuesday, September 09, 2014 2:54 PM
> To: [email protected]
> Subject: Re: Ctakes to process 5000K recoreds
>
> Nick,
> When you mean no medication is being annotated, I presume you mean the
> medication attributes (i.e. dosage, frequency, etc.) are not being
> annotated?  I think the DrugNER needs a list of section names in the
> config; I think it includes SIMPLE_SEGMENT.  I am very surprised that
> SimpleSegementAnnotator is the bottle neck though; all it does is assume
> the entire document is a single section called SIMPLE_SEGMENT.
> Have you tried commenting out the DependencyParser if you're not using
> those features.
>
> --Pei
>
>
> On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <
> [email protected]> wrote:
> >
> > Hi there,
> >
> > I am using Ctakes to process 5000K free text  records  where each record
> has several medications.
> > This is the fixed flow that it goes through:
> >
> >
> <node>SimpleSegmentAnnotator</node>
> >
>  <node>SentenceDetectorAnnotator</node>
> >
>  <node>TokenizerAnnotator</node>
> >
>  <node>LvgAnnotator</node>
> >
>  <node>ContextDependentTokenizerAnnotator</node>
> >
>  <node>POSTagger</node>
> >
>  <node>Chunker</node>
> >
>  <node>LookupWindowAnnotator</node>
> >
>  <node>DictionaryLookupAnnotatorDB</node>
> >
>  <node>DependencyParser</node>
> >
>  <node>AssertionAnnotator</node>
> >
> > <node>ExtractionPrepAnnotator</node>
> >
> > But it takes very very long time to process that many data( maybe a week
> or so) when I use SimpleSegmentAnnotator.  By eliminating
> SimpleSegmentAnnotator the process is very fast but no medication is being
> anotated.  Do you guys have any suggestion?
> >
> > Thanks,
> > Nick
> >
>

RE: Ctakes to process 5000K recoreds

Reply via email to