[ 
https://issues.apache.org/jira/browse/OPENNLP-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046012#comment-13046012
 ] 

Nicolas Hernandez commented on OPENNLP-197:
-------------------------------------------

In order to observe the difference of the model file size when the training is 
performed via command line : 
I used two corpus, one of them differing only with a whitespace character added 
at the end of each line. 

I made the test with the version v6 of the europarl corpus 
http://www.statmt.org/europarl/

Here are the procedure to repeat the problem:

#
wget http://www.statmt.org/europarl/v6/fr-en.tgz
tar xvzf fr-en.tgz

#
cat europarl-v6.fr-en.fr | head -n 1000000 > europarl-v6.fr-en.fr.1M
opennlp SentenceDetectorTrainer -encoding UTF-8 -lang fr -data 
europarl-v6.fr-en.fr.1M -model europarl6-cmdLine-1M-fr-sent.bin

#
cat europarl-v6.fr-en.fr | head -n 1000000 | perl -ne 'chomp(); print "$_ \n";' 
 > europarl-v6.fr-en.fr.1M.ws
opennlp SentenceDetectorTrainer -encoding UTF-8 -lang fr -data 
europarl-v6.fr-en.fr.1M.ws -model europarl6-cmdLine-1M-ws-fr-sent.bin

#
ls -l

17K europarl6-cmdLine-1M-fr-sent.bin
470K europarl6-cmdLine-1M-ws-fr-sent.bin

I only used "1 million" sentences because of some java heap size issue. I used 
opennlp 1.5., java 1.6, I ran the programs on linux
ubuntu 10.04.

> The UIMA "Sentence Detector Trainer" may build erratic models depending on 
> the covered text format of the sentence annotations.
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-197
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-197
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: UIMA Integration
>            Reporter: Nicolas Hernandez
>         Attachments: fr-sent.zip
>
>
> In the opennlp-uima subproject, the "Sentence Detector Training" component 
> asks for a Sentence annotation type as a parameter. 
> The component does not check whether each corresponding sentence is written 
> in its own line. 
> As a matter of fact the built model would not work as expected.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to