[ 
https://issues.apache.org/jira/browse/OPENNLP-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045421#comment-13045421
 ] 

Nicolas Hernandez commented on OPENNLP-197:
-------------------------------------------

Actually the system works as it is intended to do. 

Models are built whatever the sentences look like. For example, imagine a text 
with an undefined number of whitespace characters (including newlines) between 
tokens. It is not a problem if you only handle sentence and token annotations. 
It may be a problem if you want to use the covered text of the sentences.
This kind of texts is not a rare case. Such texts come from XML untagging or 
pdf2text transformations.

May be it is not a opennlp uima trainer issue, but the user should be warned 
about.
For processing conventionnal texts, OpenNlp uima trainer cannot be used de 
facto. The text used as a training resource should be formatted in an adequate 
way.



> The UIMA "Sentence Detector Trainer" may build erratic models depending on 
> the covered text format of the sentence annotations.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-197
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-197
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: UIMA Integration
>            Reporter: Nicolas Hernandez
>
> In the opennlp-uima subproject, the "Sentence Detector Training" component 
> asks for a Sentence annotation type as a parameter. 
> The component does not check whether each corresponding sentence is written 
> in its own line. 
> As a matter of fact the built model would not work as expected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to