[ 
https://issues.apache.org/jira/browse/OPENNLP-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989813#comment-12989813
 ] 

Jörn Kottmann commented on OPENNLP-33:
--------------------------------------

There are a few questions inside the attached document.

1. The maxent jar is still necessary since it contains all the maxent classes 
which are mostly used by the DoccatModel for serializing the embeded maxent 
binary model and by DocumentCategorizerME to perform the training and 
categorization.

2. The training format is, one document per line, first token is the the 
category and all other whitespace separated tokens are document tokens. The 
DocumentSample constructor also expects whitespace tokenized input text.

3. The parsing code you describe is mostly already in DocumentSampleStream, 
that one can parse the above described format.

> Write documentation for the document categorizer component
> ----------------------------------------------------------
>
>                 Key: OPENNLP-33
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-33
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Documentation
>            Reporter: Jörn Kottmann
>         Attachments: doccat_documentation.rtf
>
>
> Write initial documentation for the document categorizer component.
> The issue is migrated from SourceForge:
> https://sourceforge.net/tracker/?func=detail&aid=3028436&group_id=3368&atid=103368

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to