[
https://issues.apache.org/jira/browse/OPENNLP-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156419#comment-13156419
]
Joern Kottmann commented on OPENNLP-402:
----------------------------------------
How does this new format support work?
Lets say I am training the name finder, thats how I do it:
bin/opennlp TokenNameFinderCrossValidator -lang x -data x.train -featuregen
xyz.xml -params xyz.txt
After applying your patch I get this error and help message:
Error: Missing mandatory parameter: -format
Usage: opennlp TokenNameFinderCrossValidator [-resources resourcesDir] [-type
modelType] [-featuregen featuregenFile] [-params paramsFile] [-iterations num]
[-cutoff num] [-misclassified true|false] [-folds num] [-detailedF true|false]
-format formatName
Arguments description:
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-params paramsFile
Training parameters file.
-iterations num
specifies the number of training iterations. It is ignored if a
parameters file is passed.
-cutoff num
specifies the min number of times a feature must be seen. It is ignored
if a parameters file is passed.
-misclassified true|false
if true will print false negatives and false positives
-folds num
The number of folds. Default is 10
-detailedF true|false
if true will print detailed FMeasure results
-format formatName
the format of the data, for example conllx, defaults to opennlp. Format
might have its own parameters.
opennlp format usage: -lang language -data sampleData [-encoding charsetName]
Arguments description:
-lang language
specifies the language which is being processed.
-data sampleData
the data to be used
-encoding charsetName
specifies the encoding which should be used for reading and writing
text. If not specified the system default will be used.
bionlp2004 format usage: -types DNA,protein,cell_type,cell_line,RNA -lang
language -data sampleData [-encoding charsetName]
Arguments description:
-types DNA,protein,cell_type,cell_line,RNA
-lang language
specifies the language which is being processed.
-data sampleData
the data to be used
-encoding charsetName
specifies the encoding which should be used for reading and writing
text. If not specified the system default will be used.
conll03 format usage: -lang en|de -types per,loc,org,misc -data sampleData
[-encoding charsetName]
Arguments description:
-lang en|de
-types per,loc,org,misc
-data sampleData
the data to be used
-encoding charsetName
specifies the encoding which should be used for reading and writing
text. If not specified the system default will be used.
conll02 format usage: -lang es|nl -types per,loc,org,misc -data sampleData
[-encoding charsetName]
Arguments description:
-lang es|nl
-types per,loc,org,misc
-data sampleData
the data to be used
-encoding charsetName
specifies the encoding which should be used for reading and writing
text. If not specified the system default will be used.
ad format usage: -lang language -data sampleData [-encoding charsetName]
-----------
Specifying -format didn't help, do I need to have certain order? And shouldn't
it default to opennlp?
I don't mind having support for a format directly, but it should not make the
default case difficult to use.
One disadvantage of this feature is, that it is easy to make mistakes which
cannot be noticed,
e.g. encoding issues, problems in the format, etc. That may sound a bit
strange, but some bugs in the UIMA
training integration came only to my attention after I added a feature to dump
the training data to a file in the opennlp format.
Anyway, we still have the converters for debugging or the case that a user just
wants to see the training data.
> CLI tools and formats refactored
> --------------------------------
>
> Key: OPENNLP-402
> URL: https://issues.apache.org/jira/browse/OPENNLP-402
> Project: OpenNLP
> Issue Type: Improvement
> Components: Command Line Interface, Formats
> Affects Versions: tools-1.5.3-incubating
> Reporter: Aliaksandr Autayeu
> Labels: patch
> Attachments: 0016-CLI-tools-and-formats-refactored.patch
>
>
> Proposed patch refactors CLI tools and simplifies the code by introducing
> hierarchy and removing a lot of code duplication. It also introduces better
> error and help messages, including help for formats and listing available
> formats in various tools, which are now able to work with formats directly.
> This, in turn, eliminates the need to keep converted files on disk.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira