[
https://issues.apache.org/jira/browse/OPENNLP-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13045868#comment-13045868
]
Nicolas Hernandez commented on OPENNLP-197:
-------------------------------------------
Yes.
Before answering you, I performed various training configurations : I tested 2
different corpus (French Treebank
http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php and subpart of
Europarl http://www.statmt.org/europarl/) , with or without more whitespace
characters in the lines, via command line and via uima.
21K frenchTreebank-uima-fr-sent-wi-wst.bin (a sentence per line and token
separated by a whitespace character)
3,1K frenchTreebank-uima-fr-sent-wo-wst.bin (text untagged and tokens at the
same offsets as the XML version i.e. with newline and whitespace characters in
sentences)
291K europarl-uima-fr-sent.bin (the corpus comes with one sentence or tag per
line. I ve just filtered the tags)
16K europarl-cmdline-fr-sent.bin (the same corpus in the same format as
previously but via command line)
237K europarl-cmdline-ws-at-the-end-fr-sent.bin (the same corpus in the same
format as previously except that one whitespace character has been added at the
end of each line. Via command line).
It was not easy to test with the frenchTreebank via command line so I didnt do
it.
The best model seems to be europarl-cmdline-ws-at-the-end-fr-sent.bin
First you can observe that with or without whitespace characters in command
line or with uima lead to different model file size.
In practice, the sentence detection is indeed different with each of these
models.
Why there is a difference between the training via command line and via uima?
Here two-three ideas and questions.
First, looking at the code... I did not find where the whitespace based
tokenization is performed.
I found the line
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-uima/src/main/java/opennlp/uima/sentdetect/SentenceDetectorTrainer.java?view=markup
126 sentenceSamples.add(new SentenceSample(cas.getDocumentText(),
sentSpans));
calls the method
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/SentenceSample.java?view=markup
46 public SentenceSample(String document, Span... sentences) {
which does not tokenize the lines as you mentioned it.
Am I right ?
The sentences in the examples of the Sentence Detection API
http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.sentdetect.detection.api
seems to have extra whitespace characters at the beginning and at the end, is
it important ?
I didn t get into the opennlp code neither in the command line process but I
read that a whitespace character is added at the end of the sentences. See line
51
http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/SentenceSampleStream.java?view=markup
May be it is not the righ code since I see that the sentences are trimed and I
dot not understand why by adding a whitespace character at the end of each line
it changes something...
Futhermore... I ve noted another "funny" thing. I am not sure if I have to open
an issue for that or if it is linked with the previous item. I ve noted that
the models trained via uima (and without without whistepace characters) manage
to split text in sentences but gives wrong begin/end offset. I observed that
the begin of a current sentence is equal to the end of the previous one and
that the current sentence includes as a first token the punctuation character
of the previous one while the previous one does not.
For testing the sentence detection, I attach the trained models and a text
example.
> The UIMA "Sentence Detector Trainer" may build erratic models depending on
> the covered text format of the sentence annotations.
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-197
> URL: https://issues.apache.org/jira/browse/OPENNLP-197
> Project: OpenNLP
> Issue Type: Bug
> Components: UIMA Integration
> Reporter: Nicolas Hernandez
> Attachments: fr-sent.zip
>
>
> In the opennlp-uima subproject, the "Sentence Detector Training" component
> asks for a Sentence annotation type as a parameter.
> The component does not check whether each corresponding sentence is written
> in its own line.
> As a matter of fact the built model would not work as expected.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira