[ 
https://issues.apache.org/jira/browse/OPENNLP-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876444#comment-13876444
 ] 

Joern Kottmann commented on OPENNLP-602:
----------------------------------------

The Sentence Detector already supports that. A user can specify at training 
time which end-of-sentences chars should be used. The problem with the current 
design is that a newline char is used as a separator in our training file 
format. Therefore a user can't use a newline char as part of the trainng data, 
and also not as a end-of-sentence char.

To fix this I suggest that we specify that newline chars can be encoded as <CR> 
and <LF> in the training data.

Any opinions about that?

> SentenceDetector should support new line as and end of sentence char
> --------------------------------------------------------------------
>
>                 Key: OPENNLP-602
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-602
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Sentence Detector
>    Affects Versions: tools-1.5.3
>            Reporter: Joern Kottmann
>            Assignee: Joern Kottmann
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> The Sentence Detector should have support to consider new line chars as the 
> end of a sentence. This will probably require special handling in the 
> training code to assume that there is an new line char if any other eos is 
> missing.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to