>(Trying to avoid passing individual jars via email)

Understood.  I sent the latest (Saturday) build of the dictionary module that I 
haven't yet checked in.  Its dictionary format is incompatible with the format 
produced by the creator in sandbox.  I will check in all of the code changes 
once I've had a chance to clean them up and write some documentation on the 
preferred terms, icd9s, etc.

Sean
________________________________________
From: Chen, Pei [pei.c...@childrens.harvard.edu]
Sent: Tuesday, September 09, 2014 5:29 PM
To: <dev@ctakes.apache.org>
Subject: Re: Ctakes to process 5000K recoreds

(Trying to avoid passing individual jars via email)

Sent from my iPhone

> On Sep 9, 2014, at 5:26 PM, "Chen, Pei" <pei.c...@childrens.harvard.edu> 
> wrote:
>
> Sean-
> Aren't the scripts to generate the DB already available in the sandbox area?
>
> Sent from my iPhone
>
>> On Sep 9, 2014, at 5:24 PM, "Finan, Sean" <sean.fi...@childrens.harvard.edu> 
>> wrote:
>>
>> There is a tool to generate a dictionary in the new format using the UMLS 
>> MR*** files.
>>
>> The module can also read directly from a file with bar-separated values:  
>> CUI|Text or CUI|TUI|Text which could be useful for small custom dictionaries.
>>
>> I can send a copy of the dictionary creator jar and instructions tomorrow.
>>
>> Sean
>> ________________________________________
>> From: Bruce Tietjen [bruce.tiet...@perfectsearchcorp.com]
>> Sent: Tuesday, September 09, 2014 5:17 PM
>> To: dev@ctakes.apache.org
>> Subject: Re: Ctakes to process 5000K recoreds
>>
>> Sean,
>>
>> If that is a script for generating a dictionary for use with
>> dictionary-lookup-fast, I would also be very interested in checking it out.
>>
>> Thanks,
>>
>> Bruce
>>
>>
>> [image: IMAT Solutions] <http://imatsolutions.com>
>> Bruce Tietjen
>> Senior Software Engineer
>> [image: Mobile:] 801.634.1547
>> bruce.tiet...@imatsolutions.com
>>
>> On Tue, Sep 9, 2014 at 2:40 PM, Nick Nikandish <
>> snika...@emerginghealthit.com> wrote:
>>
>>> Great. I will do that. Thanks again.
>>>
>>> Nick
>>>
>>> -----Original Message-----
>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>>> Sent: Tuesday, September 09, 2014 4:39 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Just use it with cTakes.  Instead of removing other modules from the
>>> pipeline, replace the dictionary-lookup with dictionary-lookup-fast.
>>>
>>> For the
>>> desc/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml
>>> , you would modify:
>>>
>>>   <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>>     <import
>>> location="../../../ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml"/>
>>>   </delegateAnalysisEngine>
>>>
>>> To be:
>>>
>>>   <delegateAnalysisEngine key="DictionaryLookupAnnotatorDB">
>>>     <import
>>> location="../../../ctakes-dictionary-lookup-fast/desc/analysis_engine/UmlsLookupAnnotator.xml"/>
>>>   </delegateAnalysisEngine>
>>>
>>>
>>> That should be it.  You can then leave the rest of the module
>>> specifications alone.
>>>
>>> Sean
>>>
>>> ________________________________________
>>> From: Nick Nikandish [snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Hi Sean,
>>>
>>> Many thanks, I will try it tomorrow. Do you have any special instruction
>>> to run that scrip or I have to use it with cTakes?
>>>
>>> Thanks,
>>> Nick
>>>
>>> -----Original Message-----
>>> From: Finan, Sean [mailto:sean.fi...@childrens.harvard.edu]
>>> Sent: Tuesday, September 09, 2014 4:24 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Hi Nick,
>>>
>>> I think that the bottleneck is probably the lookup module itself.  So, I
>>> just sent you a secure email/ftp link.  It contains a build of the new
>>> dictionary-lookup-fast module.  Should you choose to try it, let me know
>>> how things turn out.
>>>
>>> Sean
>>> ________________________________________
>>> From: Nick Nikandish [snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 4:10 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Thanks, let me try it.
>>> Nick
>>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>>> Sent: Tuesday, September 09, 2014 4:08 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> If you just need the medication names, you can remove these:
>>> <node>ContextDependentTokenizerAnnotator</node>
>>> <node>DependencyParser</node>
>>> <node>AssertionAnnotator</node>
>>>
>>> You might be able to get rid of the LvgAnnotator and still get decent
>>> results since variations of word form should not affect medication names. I
>>> would try with it and without it on a smaller set of files and see if you
>>> see a difference.
>>>
>>> I believe the others are needed by the default configs for medication
>>> lookup. For example, POS is used to get phrase type. Phrases are used to
>>> remove verb phrases from the lookup and also therefore to keep the lookup
>>> windows from getting too big.  I'm more familiar with the other types of
>>> named entities (diseases, symptoms, etc) than with medications.
>>>
>>> -----Original Message-----
>>> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 3:01 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> James,
>>>
>>> Do you have any suggestion about running cTakes with minimum annotators
>>> that can return Medications in DictionaryLookupAnnotator?
>>> Thanks,
>>> Nick
>>>
>>> -----Original Message-----
>>> From: Masanz, James J. [mailto:masanz.ja...@mayo.edu]
>>> Sent: Tuesday, September 09, 2014 3:05 PM
>>> To: 'dev@ctakes.apache.org'
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> I suspect that when you take out simple segment annotated, nothing is
>>> getting processed, and that is why it appears so fast. At least some of the
>>> annotators loop through the list of sections/segments, which is why there
>>> is a simple segment annotator - so that there is at least one
>>> section/segment identified. Are you getting any annotations at all?
>>>
>>> -----Original Message-----
>>> From: Nick Nikandish [mailto:snika...@emerginghealthit.com]
>>> Sent: Tuesday, September 09, 2014 2:02 PM
>>> To: dev@ctakes.apache.org
>>> Subject: RE: Ctakes to process 5000K recoreds
>>>
>>> Pei,
>>> I need the name of the medications for the application that I wrote and
>>> uses ctakes.....so I cache the medication in DictionaryLookupAnnotator(in
>>> performLookup()) and use them in my program but when I have
>>> SimpleSegementAnnotator it just takes forever. After taking
>>> SimpleSegementAnnotator out, no medication name in
>>> DictionaryLookupAnnotator is returned in the code. So I was wondering if
>>> there was a way that I could eliminate SimpleSegementAnnotator but still
>>> be  able to get the medications name in that class?
>>>
>>> Nick
>>>
>>> -----Original Message-----
>>> From: Pei Chen [mailto:chen...@apache.org]
>>> Sent: Tuesday, September 09, 2014 2:54 PM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: Ctakes to process 5000K recoreds
>>>
>>> Nick,
>>> When you mean no medication is being annotated, I presume you mean the
>>> medication attributes (i.e. dosage, frequency, etc.) are not being
>>> annotated?  I think the DrugNER needs a list of section names in the
>>> config; I think it includes SIMPLE_SEGMENT.  I am very surprised that
>>> SimpleSegementAnnotator is the bottle neck though; all it does is assume
>>> the entire document is a single section called SIMPLE_SEGMENT.
>>> Have you tried commenting out the DependencyParser if you're not using
>>> those features.
>>>
>>> --Pei
>>>
>>>
>>> On Tue, Sep 9, 2014 at 2:45 PM, Nick Nikandish <
>>> snika...@emerginghealthit.com> wrote:
>>>>
>>>> Hi there,
>>>>
>>>> I am using Ctakes to process 5000K free text  records  where each record
>>> has several medications.
>>>> This is the fixed flow that it goes through:
>>> <node>SimpleSegmentAnnotator</node>
>>> <node>SentenceDetectorAnnotator</node>
>>> <node>TokenizerAnnotator</node>
>>> <node>LvgAnnotator</node>
>>> <node>ContextDependentTokenizerAnnotator</node>
>>> <node>POSTagger</node>
>>> <node>Chunker</node>
>>> <node>LookupWindowAnnotator</node>
>>> <node>DictionaryLookupAnnotatorDB</node>
>>> <node>DependencyParser</node>
>>> <node>AssertionAnnotator</node>
>>>>
>>>> <node>ExtractionPrepAnnotator</node>
>>>>
>>>> But it takes very very long time to process that many data( maybe a week
>>> or so) when I use SimpleSegmentAnnotator.  By eliminating
>>> SimpleSegmentAnnotator the process is very fast but no medication is being
>>> anotated.  Do you guys have any suggestion?
>>>>
>>>> Thanks,
>>>> Nick
>>>

Reply via email to