[
https://issues.apache.org/jira/browse/OPENNLP-33?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989813#comment-12989813
]
Jörn Kottmann commented on OPENNLP-33:
--------------------------------------
There are a few questions inside the attached document.
1. The maxent jar is still necessary since it contains all the maxent classes
which are mostly used by the DoccatModel for serializing the embeded maxent
binary model and by DocumentCategorizerME to perform the training and
categorization.
2. The training format is, one document per line, first token is the the
category and all other whitespace separated tokens are document tokens. The
DocumentSample constructor also expects whitespace tokenized input text.
3. The parsing code you describe is mostly already in DocumentSampleStream,
that one can parse the above described format.
> Write documentation for the document categorizer component
> ----------------------------------------------------------
>
> Key: OPENNLP-33
> URL: https://issues.apache.org/jira/browse/OPENNLP-33
> Project: OpenNLP
> Issue Type: Improvement
> Components: Documentation
> Reporter: Jörn Kottmann
> Attachments: doccat_documentation.rtf
>
>
> Write initial documentation for the document categorizer component.
> The issue is migrated from SourceForge:
> https://sourceforge.net/tracker/?func=detail&aid=3028436&group_id=3368&atid=103368
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira