[
https://issues.apache.org/jira/browse/OPENNLP-543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487425#comment-13487425
]
James Kosin commented on OPENNLP-543:
-------------------------------------
Hmm...
I thought the Corpus Server was going to generate the corpus data in the
correct formats? Was I wrong, or is this still a work in progress?
If so, it would be better to push a document on the OpenNLP formats and get the
Corpus Server to implement.
Marc, the POS and NER models rely on at least the sentence detector and the
tokenizer to be run first. So, it is best to get familiar with the first two
requirements before jumping into the NER or POS models. You can use the same
training data for all 4 if you like.
The sentence detector only requires each complete sentence start on a new line.
The tokenizer requires a <SPLIT> between the tokens.
ie: "Wow!" ==>becomes==> " <SPLIT> Wow <SPLIT> ! <SPLIT> " <==in the training
file.
> Documentation of OpenNLP Traning Format
> ---------------------------------------
>
> Key: OPENNLP-543
> URL: https://issues.apache.org/jira/browse/OPENNLP-543
> Project: OpenNLP
> Issue Type: Bug
> Reporter: Marc Schreiber
>
> Is there any documentation about the training formats which OpenNLP supports?
> I'm working on a project where we need our own models because the project
> concentrates on specific domains. It would be really great if there is any
> help for building your own models.
> If there is no documentation I would offer my help for creating such a
> documentation but I need someone who helps me with the training formats.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira