RE: sentence splitter & forks/branches

digital paula Sat, 18 Jan 2014 18:40:21 -0800
Thank you so much Tim for the prompt response.   I appreciate the additional 
info and suggestions that you had provided.  Yes, I see that it is the i2b2 
challenge 2010 dataset.  
 
Can I just ask what is the machine learning algorithm that is being used?   
 
Thanks.
 
Regards,
Paula
 
> From: [email protected]
> To: [email protected]
> Subject: Re: sentence splitter & forks/branches
> Date: Sat, 18 Jan 2014 13:02:06 +0000
> 
> Sorry Paula, it's been a busy few weeks. I'm sure everyone else has been
> busy as well.
> 
> I'm sorry to say I think at this point it might be difficult to get the
> exact fix you want out of the module. It works in 2 parts I believe:
> 1) Identify cue words
> 2) Classify entities given the identified cue words.
> 
> And you fixed 1) to recognize your cue word, but if 2) uses a machine
> learning model it may not get the right outcome sometimes and that can
> be hard to fix. It obviously wouldn't have seen any examples using that
> keyword, though I might've thought that there might be some cases it
> would get right using other features.
> 
> If you've tried a bunch of different examples and it seems like it can't
> get any of them right with new cue words, then there are a few things
> you might consider as next steps:
> 
> 1) Write your own rule-based analysis engine to follow the existing
> assertion module and use some simple algorithm to link your cue words
> with nearby entities.
> 2) Acquire training data and try to re-train the assertion module with
> your cue word additions. I believe they used the i2b2 challenge 2010
> concept assertion dataset which is available with a data use agreement.
> 
> Hope this helps,
> Tim
> 
> 
> 
> On 01/17/2014 10:46 PM, digital paula wrote:
> >
> >
> > Hello again cTAKES Community,  I thought that adding the sentence 
> > splitter(w/newline-sentence-continuation-recognition) would have been as 
> > simple as it was adding the sectionizer annotator to the eclipse 
> > environment.  I see per VJ's note that it's not that simple, my 
> > understanding is that the standard clinical pipeline requires the assertion 
> > and dependency parsers. I've explored a bit of the changes needed and at 
> > least for Assertion looks like SentenceDetector, SentenceSpan, likely the 
> > SingleDocumentProcessor from the MITRE jar will need to be modified to 
> > recognize multi-line sentences.   This is so the assertion and dependency 
> > parsers can be kept in the pipeline.  I would love to devote the time 
> > needed to fix the sentence split to recognize sentences that are multiline 
> > but I need to focus on hacking my way through the cue word issue because 
> > I've been left in the lurch with no response to my posts  :-(((((  
> > Regards,
> > Paula
> >  
> >> Date: Wed, 15 Jan 2014 14:53:17 -0500
> >> Subject: Re: sentence splitter & forks/branches
> >> From: [email protected]
> >> To: [email protected]
> >>
> >> It is unfortunately not that trivial, as allowing newlines within sentences
> >> requires changes to the assertion and dependency parser modules.
> >>
> >> If you're not using those AEs you could theoretically build the ytex
> >> branch, and just add  ctakes-ytex-uima.jar and
> >> ctakes-ytex-uima\desc\analysis_engine\SentenceDetectorAnnotator.xml to your
> >> exsting ctakes install (haven't tried it, but it should work).
> >>
> >> -vj
> >>
> >>
> >> On Wed, Jan 15, 2014 at 1:57 PM, Lingren, Todd 
> >> <[email protected]>wrote:
> >>
> >>> I have a general question about forks, specifically the YTEX branch that
> >>> Vijay mentions.
> >>> If I wanted to implement just the sentence splitter from YTEX into a
> >>> currently existing 3.1 install, how would I do that? Is it possible? Or do
> >>> I have to switch over completely to run from YTEX branch?
> >>>
> >>> Todd Lingren
> >>> Biomedical Informatics
> >>> Cincinnati Children's Hospital
> >>> [email protected]
> >>> 513-803-9032
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: vijay garla [mailto:[email protected]]
> >>> Sent: Wednesday, January 15, 2014 11:34 AM
> >>> To: [email protected]
> >>> Subject: Re: svn commit: r1551805 -
> >>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesImpl.java
> >>>
> >>> The issue is indeed the sentence splitter - negation is limited to words
> >>> within the sentence, and if newlines are considered sentence boundaries, 
> >>> it
> >>> doesn't work properly (splitting on newlines breaks many other things as
> >>> well).  The YTEX branch includes a sentence splitter that does not
> >>> automatically split sentences on newlines.
> >>>
> >>> best,
> >>>
> >>> vj
> >>>
> >>>
> >>> On Wed, Jan 15, 2014 at 10:03 AM, Masanz, James J. <[email protected]
> >>>> wrote:
> >>>> Hi Paula,
> >>>>
> >>>> The sentence detector in 3.1.0 and 3.1.1 (and previous releases)
> >>>> assumes sentences don't cross line boundaries.
> >>>> OpenNLP is used to find sentence breaks, but then if newlines are
> >>>> found, those are also set (within cTAKES, not OpenNLP) to be sentence
> >>> breaks.
> >>>> (just FYI I haven't had a chance to look at the ytex branch, which the
> >>>> subject commit is about)
> >>>>
> >>>> -- James
> >>>>
> >>>> -----Original Message-----
> >>>> From: [email protected] [mailto:
> >>>> [email protected]] On Behalf Of
> >>>> digital paula
> >>>> Sent: Tuesday, January 14, 2014 10:25 PM
> >>>> To: [email protected]
> >>>> Subject: RE: svn commit: r1551805 -
> >>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> >>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> >>>> Impl.java
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Hello cTAKES Developer Community,
> >>>>  I'm a little behind on reading posts....this one is from last month.
> >>>> I think this issue is already addressed in current release? I'm still
> >>>> running the previous release...3.1.0.
> >>>> I just noticed something interesting, the negation didn't take when it
> >>>> is on a different line.  I just removed all carriage returns from
> >>> narratives
> >>>> and negation picked it up as long as it's treated as one long string.
> >>> To
> >>>> better explain what I mean.  Two narrative comments below.
> >>>>
> >>>> 1.  patient did not have diabetes
> >>>> 2. patient did not have
> >>>> diabetes
> >>>>
> >>>> Number 1 above got negated but number 2 did not. This might be related
> >>>> to the issue w/the sectionizer.  I noticed that when I treated the
> >>> narrative
> >>>> as one string the sectionizer never crashes with the NPE.   Well the
> >>>> sectionizer is of no point if narrative is as one string but it's
> >>>> helping me pinpoint the problem.
> >>>>
> >>>> Regards,
> >>>> Paula
> >>>>
> >>>>
> >>>>> Date: Thu, 19 Dec 2013 11:04:57 -0500
> >>>>> Subject: Re: FW: svn commit: r1551805 -
> >>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> >>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> >>>> Impl.java
> >>>>> From: [email protected]
> >>>>> To: [email protected]
> >>>>>
> >>>>> Hi Pei,
> >>>>>
> >>>>> I'm not sure if that would solve the problem: change in the ytex
> >>>>> branch causes newlines to be ignored (i.e. not treated as a token).
> >>>>> trunk's sentence splitter is splits sentences on newlines, so
> >>>>> newlines would
> >>>> never
> >>>>> be found in a sentence.  However, if we had a reproducer we could
> >>>>> check
> >>>> it
> >>>>> fairly easily in the ytex branch.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> VJ
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 19, 2013 at 10:15 AM, Chen, Pei
> >>>>> <[email protected]>wrote:
> >>>>>
> >>>>>> Vj,
> >>>>>> Do you think this is what was causing the NPE's [1]?
> >>>>>> If so, shall we make the same fix in trunk?
> >>>>>> --Pei
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>> http://mail-archives.apache.org/mod_mbox/ctakes-dev/201309.mbox/%3C924
> >>>> DE05C19409B438EB81DE683A942D9105A93CB%40CHEXMBX1A.CHBOSTON.ORG%3E
> >>>>>> -----Original Message-----
> >>>>>> From: [email protected] [mailto:[email protected]]
> >>>>>> Sent: Tuesday, December 17, 2013 9:15 PM
> >>>>>> To: [email protected]
> >>>>>> Subject: svn commit: r1551805 -
> >>>>>>
> >>>> /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> >>>> /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> >>>> Impl.java
> >>>>>> Author: vjapache
> >>>>>> Date: Wed Dec 18 02:14:13 2013
> >>>>>> New Revision: 1551805
> >>>>>>
> >>>>>> URL: http://svn.apache.org/r1551805
> >>>>>> Log:
> >>>>>> add support for sentences that contain newline tokens.
> >>>>>>
> >>>>>> Modified:
> >>>>>>
> >>>>>>
> >>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> >>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> >>>> mpl.java
> >>>>>> Modified:
> >>>>>>
> >>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> >>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> >>>> mpl.java
> >>>>>> URL:
> >>>>>>
> >>>> http://svn.apache.org/viewvc/ctakes/branches/ytex/ctakes-assertion/src
> >>>> /main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffs
> >>>> etToLineTokenConverterCtakesImpl.java?rev=1551805&r1=1551804&r2=155180
> >>>> 5&view=diff
> >>>>>>
> >>>> ======================================================================
> >>>> ========
> >>>>>> ---
> >>>>>>
> >>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> >>>> assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> >>>> mpl.java
> >>>>>> (original)
> >>>>>> +++
> >>>> ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake
> >>>>>> +++
> >>>> s/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCta
> >>>>>> +++ kesImpl.java Wed Dec 18 02:14:13 2013
> >>>>>> @@ -32,8 +32,8 @@ import org.apache.uima.jcas.tcas.Annotat  import
> >>>>>> org.mitre.medfacts.i2b2.api.ApiConcept;
> >>>>>>  import
> >>>>>> org.mitre.medfacts.zoner.CharacterOffsetToLineTokenConverter;
> >>>>>>  import org.mitre.medfacts.zoner.LineAndTokenPosition;
> >>>>>> -
> >>>>>>  import org.apache.ctakes.typesystem.type.syntax.BaseToken;
> >>>>>> +import org.apache.ctakes.typesystem.type.syntax.NewlineToken;
> >>>>>>  import org.apache.ctakes.typesystem.type.textspan.Sentence;
> >>>>>>
> >>>>>>  public class CharacterOffsetToLineTokenConverterCtakesImpl
> >>>>>> implements CharacterOffsetToLineTokenConverter
> >>>>>> @@ -78,11 +78,13 @@ public class CharacterOffsetToLineTokenC
> >>>>>>           for (Annotation current : annotationIndex)
> >>>>>>           {
> >>>>>>                   BaseToken bt = (BaseToken)current;
> >>>>>> -                 int begin = bt.getBegin();
> >>>>>> -                 int end = bt.getEnd();
> >>>>>> -
> >>>>>> -                 tokenBeginEndTreeSet.add(begin);
> >>>>>> -                 tokenBeginEndTreeSet.add(end);
> >>>>>> +                 // filter out NewlineToken
> >>>>>> +                 if (!(bt instanceof NewlineToken)) {
> >>>>>> +                         int begin = bt.getBegin();
> >>>>>> +                         int end = bt.getEnd();
> >>>>>> +                         tokenBeginEndTreeSet.add(begin);
> >>>>>> +                         tokenBeginEndTreeSet.add(end);
> >>>>>> +                 }
> >>>>>>           }
> >>>>>>    }
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >                                       
>
RE: sentence splitter & forks/branches

Reply via email to