[ 
https://issues.apache.org/jira/browse/OPENNLP-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149517#comment-13149517
 ] 

Joern Kottmann commented on OPENNLP-371:
----------------------------------------

The trained model is one which will not work at all because it will always make 
the same decision. As James said, the trainer should recognize that the 
training data has only one outcome and then report an appropriate error message.
                
> Confusing error message in tokenizer training
> ---------------------------------------------
>
>                 Key: OPENNLP-371
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-371
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>            Priority: Minor
>              Labels: model, tokenizer, training
>
> The following error message
> java.lang.IllegalArgumentException: The maxent model is not compatible with 
> the tokenizer!
>       at 
> opennlp.tools.util.model.BaseModel.checkArtifactMap(BaseModel.java:275)
>       at opennlp.tools.tokenize.TokenizerModel.<init>(TokenizerModel.java:73)
>       at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:267)
>       at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:231)
>       at opennlp.tools.tokenize.TokenizerME.train(TokenizerME.java:293)
>       at 
> opennlp.tools.tokenize.TokenizerTestUtil.createMaxentTokenModel(TokenizerTestUtil.java:67)
>       at 
> opennlp.tools.tokenize.TokenizerMETest.testTokenizer(TokenizerMETest.java:54)
> ... cut
> might be confusing. 
> Due to error in my conversion tool, I tried to train a tokenizer model on 
> data without <SPLIT>s, which resulted in a model with one outcome only. This 
> model did not pass validation in ModelUtil.validateOutcomes(), which is 
> correct, however, the error message is a bit confusing and it took some time 
> to understood what is going on. 
> I would agree, that a model with different outcomes than expected is 
> incompatible with the tool, but with less outcomes? Is the model with less 
> outcomes than expected really incompatible? For example, with POS tagger I 
> have corpora and models which use a subset of PTB tagset. 
> However, in case of tokenizer this incompatibility makes sense (model with 1 
> outcome does not work) and in this case the message might be improved to 
> indicate the cause better. Something like: "The maxent model is not 
> compatible with the tokenizer: outcome XXX is not found". 
> Please, advice. Thank you!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to