RE: sentence splitter & forks/branches

digital paula Fri, 17 Jan 2014 19:46:31 -0800

Hello again cTAKES Community,  I thought that adding the sentence 
splitter(w/newline-sentence-continuation-recognition) would have been as simple 
as it was adding the sectionizer annotator to the eclipse environment.  I see 
per VJ's note that it's not that simple, my understanding is that the standard 
clinical pipeline requires the assertion and dependency parsers. I've explored 
a bit of the changes needed and at least for Assertion looks like 
SentenceDetector, SentenceSpan, likely the SingleDocumentProcessor from the 
MITRE jar will need to be modified to recognize multi-line sentences.   This is 
so the assertion and dependency parsers can be kept in the pipeline.  I would 
love to devote the time needed to fix the sentence split to recognize sentences 
that are multiline but I need to focus on hacking my way through the cue word 
issue because I've been left in the lurch with no response to my posts  :-((((( 
 
Regards,
Paula
 
> Date: Wed, 15 Jan 2014 14:53:17 -0500
> Subject: Re: sentence splitter & forks/branches
> From: [email protected]
> To: [email protected]
> 
> It is unfortunately not that trivial, as allowing newlines within sentences
> requires changes to the assertion and dependency parser modules.
> 
> If you're not using those AEs you could theoretically build the ytex
> branch, and just add  ctakes-ytex-uima.jar and
> ctakes-ytex-uima\desc\analysis_engine\SentenceDetectorAnnotator.xml to your
> exsting ctakes install (haven't tried it, but it should work).
> 
> -vj
> 
> 
> On Wed, Jan 15, 2014 at 1:57 PM, Lingren, Todd <[email protected]>wrote:
> 
> > I have a general question about forks, specifically the YTEX branch that
> > Vijay mentions.
> > If I wanted to implement just the sentence splitter from YTEX into a
> > currently existing 3.1 install, how would I do that? Is it possible? Or do
> > I have to switch over completely to run from YTEX branch?
> >
> > Todd Lingren
> > Biomedical Informatics
> > Cincinnati Children's Hospital
> > [email protected]
> > 513-803-9032
> >
> >
> > -----Original Message-----
> > From: vijay garla [mailto:[email protected]]
> > Sent: Wednesday, January 15, 2014 11:34 AM
> > To: [email protected]
> > Subject: Re: svn commit: r1551805 -
> > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesImpl.java
> >
> > The issue is indeed the sentence splitter - negation is limited to words
> > within the sentence, and if newlines are considered sentence boundaries, it
> > doesn't work properly (splitting on newlines breaks many other things as
> > well).  The YTEX branch includes a sentence splitter that does not
> > automatically split sentences on newlines.
> >
> > best,
> >
> > vj
> >
> >
> > On Wed, Jan 15, 2014 at 10:03 AM, Masanz, James J. <[email protected]
> > >wrote:
> >
> > > Hi Paula,
> > >
> > > The sentence detector in 3.1.0 and 3.1.1 (and previous releases)
> > > assumes sentences don't cross line boundaries.
> > > OpenNLP is used to find sentence breaks, but then if newlines are
> > > found, those are also set (within cTAKES, not OpenNLP) to be sentence
> > breaks.
> > >
> > > (just FYI I haven't had a chance to look at the ytex branch, which the
> > > subject commit is about)
> > >
> > > -- James
> > >
> > > -----Original Message-----
> > > From: [email protected] [mailto:
> > > [email protected]] On Behalf Of
> > > digital paula
> > > Sent: Tuesday, January 14, 2014 10:25 PM
> > > To: [email protected]
> > > Subject: RE: svn commit: r1551805 -
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> > > Impl.java
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hello cTAKES Developer Community,
> > >  I'm a little behind on reading posts....this one is from last month.
> > > I think this issue is already addressed in current release? I'm still
> > > running the previous release...3.1.0.
> > > I just noticed something interesting, the negation didn't take when it
> > > is on a different line.  I just removed all carriage returns from
> > narratives
> > > and negation picked it up as long as it's treated as one long string.
> > To
> > > better explain what I mean.  Two narrative comments below.
> > >
> > > 1.  patient did not have diabetes
> > > 2. patient did not have
> > > diabetes
> > >
> > > Number 1 above got negated but number 2 did not. This might be related
> > > to the issue w/the sectionizer.  I noticed that when I treated the
> > narrative
> > > as one string the sectionizer never crashes with the NPE.   Well the
> > > sectionizer is of no point if narrative is as one string but it's
> > > helping me pinpoint the problem.
> > >
> > > Regards,
> > > Paula
> > >
> > >
> > > > Date: Thu, 19 Dec 2013 11:04:57 -0500
> > > > Subject: Re: FW: svn commit: r1551805 -
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> > > Impl.java
> > > > From: [email protected]
> > > > To: [email protected]
> > > >
> > > > Hi Pei,
> > > >
> > > > I'm not sure if that would solve the problem: change in the ytex
> > > > branch causes newlines to be ignored (i.e. not treated as a token).
> > > > trunk's sentence splitter is splits sentences on newlines, so
> > > > newlines would
> > > never
> > > > be found in a sentence.  However, if we had a reproducer we could
> > > > check
> > > it
> > > > fairly easily in the ytex branch.
> > > >
> > > > Best,
> > > >
> > > > VJ
> > > >
> > > >
> > > > On Thu, Dec 19, 2013 at 10:15 AM, Chen, Pei
> > > > <[email protected]>wrote:
> > > >
> > > > > Vj,
> > > > > Do you think this is what was causing the NPE's [1]?
> > > > > If so, shall we make the same fix in trunk?
> > > > > --Pei
> > > > >
> > > > > [1]
> > > > >
> > > http://mail-archives.apache.org/mod_mbox/ctakes-dev/201309.mbox/%3C924
> > > DE05C19409B438EB81DE683A942D9105A93CB%40CHEXMBX1A.CHBOSTON.ORG%3E
> > > > >
> > > > > -----Original Message-----
> > > > > From: [email protected] [mailto:[email protected]]
> > > > > Sent: Tuesday, December 17, 2013 9:15 PM
> > > > > To: [email protected]
> > > > > Subject: svn commit: r1551805 -
> > > > >
> > > /ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes
> > > /assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakes
> > > Impl.java
> > > > >
> > > > > Author: vjapache
> > > > > Date: Wed Dec 18 02:14:13 2013
> > > > > New Revision: 1551805
> > > > >
> > > > > URL: http://svn.apache.org/r1551805
> > > > > Log:
> > > > > add support for sentences that contain newline tokens.
> > > > >
> > > > > Modified:
> > > > >
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> > > mpl.java
> > > > >
> > > > > Modified:
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> > > mpl.java
> > > > > URL:
> > > > >
> > > http://svn.apache.org/viewvc/ctakes/branches/ytex/ctakes-assertion/src
> > > /main/java/org/apache/ctakes/assertion/medfacts/i2b2/api/CharacterOffs
> > > etToLineTokenConverterCtakesImpl.java?rev=1551805&r1=1551804&r2=155180
> > > 5&view=diff
> > > > >
> > > > >
> > > ======================================================================
> > > ========
> > > > > ---
> > > > >
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctakes/
> > > assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCtakesI
> > > mpl.java
> > > > > (original)
> > > > > +++
> > > ctakes/branches/ytex/ctakes-assertion/src/main/java/org/apache/ctake
> > > > > +++
> > > s/assertion/medfacts/i2b2/api/CharacterOffsetToLineTokenConverterCta
> > > > > +++ kesImpl.java Wed Dec 18 02:14:13 2013
> > > > > @@ -32,8 +32,8 @@ import org.apache.uima.jcas.tcas.Annotat  import
> > > > > org.mitre.medfacts.i2b2.api.ApiConcept;
> > > > >  import
> > > > > org.mitre.medfacts.zoner.CharacterOffsetToLineTokenConverter;
> > > > >  import org.mitre.medfacts.zoner.LineAndTokenPosition;
> > > > > -
> > > > >  import org.apache.ctakes.typesystem.type.syntax.BaseToken;
> > > > > +import org.apache.ctakes.typesystem.type.syntax.NewlineToken;
> > > > >  import org.apache.ctakes.typesystem.type.textspan.Sentence;
> > > > >
> > > > >  public class CharacterOffsetToLineTokenConverterCtakesImpl
> > > > > implements CharacterOffsetToLineTokenConverter
> > > > > @@ -78,11 +78,13 @@ public class CharacterOffsetToLineTokenC
> > > > >           for (Annotation current : annotationIndex)
> > > > >           {
> > > > >                   BaseToken bt = (BaseToken)current;
> > > > > -                 int begin = bt.getBegin();
> > > > > -                 int end = bt.getEnd();
> > > > > -
> > > > > -                 tokenBeginEndTreeSet.add(begin);
> > > > > -                 tokenBeginEndTreeSet.add(end);
> > > > > +                 // filter out NewlineToken
> > > > > +                 if (!(bt instanceof NewlineToken)) {
> > > > > +                         int begin = bt.getBegin();
> > > > > +                         int end = bt.getEnd();
> > > > > +                         tokenBeginEndTreeSet.add(begin);
> > > > > +                         tokenBeginEndTreeSet.add(end);
> > > > > +                 }
> > > > >           }
> > > > >    }
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
> >
> >
RE: sentence splitter & forks/branches

Reply via email to