[jira] [Updated] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters

Pei Chen (JIRA) Thu, 30 Apr 2015 13:15:35 -0700

     [ 
https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pei Chen updated CTAKES-155:
----------------------------
    Fix Version/s:     (was: 3.2.2)
                   3.2.3

> SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters
> -------------------------------------------------------------------------
>
>                 Key: CTAKES-155
>                 URL: https://issues.apache.org/jira/browse/CTAKES-155
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-core
>    Affects Versions: 3.0-incubating
>            Reporter: Steven Bethard
>             Fix For: 3.2.3
>
>
> The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I 
> believe it assumes all sections are 5 characters long here:
> {code:java}
>       fileReader.read(sectIdArr, 0, 5);
> {code}
> As a result, when the section name is longer than that, some part of the 
> section heading (e.g. for a 6 letter section name, the final "]") is left in 
> the text of the next section. This results, for example, in the dependency 
> parser choking:
> {code:java}
> Caused by: java.lang.NullPointerException
>       at clear.pos.PosEnLib.isNoun(PosEnLib.java:56)
>       at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273)
>       at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247)
> {code}
> I would fix this but:
> (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's 
> documentation actually says "Creates a single segment annotation that spans 
> the entire document" which is just untrue, so I'm not really sure what this 
> annotator is intended to do.
> (2) Even if I make some assumptions about what it's intended to do, the code 
> is written in an extremely brittle fashion, and I'm afraid to make changes to 
> that. For what it's worth, here's what I think the annotator should really 
> look like:
> {code:java}
>   public static class SegmentsFromBracketedSectionTagsAnnotator extends 
> JCasAnnotator_ImplBase {
>     private static Pattern SECTION_PATTERN =
>         Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end 
> section id=\"?(.*?)\"?\\])", Pattern.DOTALL);
>     @Override
>     public void process(JCas jCas) throws AnalysisEngineProcessException {
>       Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText());
>       while (matcher.find()) {
>         Segment segment = new Segment(jCas);
>         segment.setBegin(matcher.start() + matcher.group(1).length());
>         segment.setEnd(matcher.end() - matcher.group(3).length());
>         segment.setId(matcher.group(2));
>         segment.addToIndexes();
>       }
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters

Reply via email to