[
https://issues.apache.org/jira/browse/OPENNLP-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045886#comment-13045886
]
Jörn Kottmann commented on OPENNLP-197:
---------------------------------------
I confused things. The tokenizer does whitespace tokenization first, for some
reason I thought the sentence detector also does.
The code which deals with the whitespace feature generation is here:
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/DefaultSDContextGenerator.java?view=markup
The sentence detector generates feature for the white spaces between sentences,
and if there are none in the training data it seems to make a difference. I
guess we have to fix a code a little to handle this more robust.
> The UIMA "Sentence Detector Trainer" may build erratic models depending on
> the covered text format of the sentence annotations.
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-197
> URL: https://issues.apache.org/jira/browse/OPENNLP-197
> Project: OpenNLP
> Issue Type: Bug
> Components: UIMA Integration
> Reporter: Nicolas Hernandez
> Attachments: fr-sent.zip
>
>
> In the opennlp-uima subproject, the "Sentence Detector Training" component
> asks for a Sentence annotation type as a parameter.
> The component does not check whether each corresponding sentence is written
> in its own line.
> As a matter of fact the built model would not work as expected.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira