[ 
https://issues.apache.org/jira/browse/OPENNLP-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045868#comment-13045868
 ] 

Nicolas Hernandez commented on OPENNLP-197:
-------------------------------------------

Yes.

Before answering you, I performed various training configurations : I tested 2 
different corpus (French Treebank 
http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php and subpart of 
Europarl http://www.statmt.org/europarl/) , with or without more whitespace 
characters in the lines, via command line and via uima.

21K frenchTreebank-uima-fr-sent-wi-wst.bin (a sentence per line and token 
separated by a whitespace character)
3,1K frenchTreebank-uima-fr-sent-wo-wst.bin (text untagged and tokens at the 
same offsets as the XML version i.e. with newline and whitespace characters in 
sentences) 
291K europarl-uima-fr-sent.bin (the corpus comes with one sentence or tag per 
line. I ve just filtered the tags)
16K europarl-cmdline-fr-sent.bin (the same corpus in the same format as 
previously but via command line)
237K europarl-cmdline-ws-at-the-end-fr-sent.bin (the same corpus in the same 
format as previously except that one whitespace character has been added at the 
end of each line. Via command line).

It was not easy to test with the frenchTreebank via command line so I didnt do 
it.

The best model seems to be europarl-cmdline-ws-at-the-end-fr-sent.bin

First you can observe that with or without whitespace characters in command 
line or with uima lead to different model file size. 
In practice, the sentence detection is indeed different with each of these 
models.

Why there is a difference between the training via command line and via uima? 
Here two-three ideas and questions. 

First, looking at the code... I did not find where the whitespace based 
tokenization is performed.
I found the line 
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/sentdetect/SentenceDetectorTrainer.java?view=markup
126         sentenceSamples.add(new SentenceSample(cas.getDocumentText(), 
sentSpans));
calls the method
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/SentenceSample.java?view=markup
46       public SentenceSample(String document, Span... sentences) {
which does not tokenize the lines as you mentioned it.

Am I right ?

The sentences in the examples of the Sentence Detection API
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.sentdetect.detection.api
seems to have extra whitespace characters at the beginning and at the end, is 
it important ? 
I didn t get into the opennlp code neither in the command line process but I 
read that a whitespace character is added at the end of the sentences. See line 
51
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/SentenceSampleStream.java?view=markup
May be it is not the righ code since I see that the sentences are trimed and I 
dot not understand why by adding a whitespace character at the end of each line 
it changes something...

Futhermore... I ve noted another "funny" thing. I am not sure if I have to open 
an issue for that or if it is linked with the previous item. I ve noted that 
the models trained via uima (and without without whistepace characters) manage 
to split text in sentences but gives wrong begin/end offset. I observed that 
the begin of a current sentence is equal to the end of the previous one and 
that the current sentence includes as a first token the punctuation character 
of the previous one while the previous one does not. 

For testing the sentence detection, I attach the trained models and a text 
example.


> The UIMA "Sentence Detector Trainer" may build erratic models depending on 
> the covered text format of the sentence annotations.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-197
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-197
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: UIMA Integration
>            Reporter: Nicolas Hernandez
>         Attachments: fr-sent.zip
>
>
> In the opennlp-uima subproject, the "Sentence Detector Training" component 
> asks for a Sentence annotation type as a parameter. 
> The component does not check whether each corresponding sentence is written 
> in its own line. 
> As a matter of fact the built model would not work as expected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to