[ 
https://issues.apache.org/jira/browse/OPENNLP-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156419#comment-13156419
 ] 

Joern Kottmann commented on OPENNLP-402:
----------------------------------------

How does this new format support work?

Lets say I am training the name finder, thats how I do it:
bin/opennlp TokenNameFinderCrossValidator -lang x -data x.train -featuregen 
xyz.xml -params xyz.txt

After applying your patch I get this error and help message:
Error: Missing mandatory parameter: -format
Usage: opennlp TokenNameFinderCrossValidator [-resources resourcesDir] [-type 
modelType] [-featuregen featuregenFile] [-params paramsFile] [-iterations num] 
[-cutoff num] [-misclassified true|false] [-folds num] [-detailedF true|false] 
-format formatName

Arguments description:
    -resources resourcesDir
        The resources directory
    -type modelType
        The type of the token name finder model
    -featuregen featuregenFile
        The feature generator descriptor file
    -params paramsFile
        Training parameters file.
    -iterations num
        specifies the number of training iterations. It is ignored if a 
parameters file is passed.
    -cutoff num
        specifies the min number of times a feature must be seen. It is ignored 
if a parameters file is passed.
    -misclassified true|false
        if true will print false negatives and false positives
    -folds num
        The number of folds. Default is 10
    -detailedF true|false
        if true will print detailed FMeasure results
    -format formatName
        the format of the data, for example conllx, defaults to opennlp. Format 
might have its own parameters.

opennlp format usage: -lang language -data sampleData [-encoding charsetName]

Arguments description:
    -lang language
        specifies the language which is being processed.
    -data sampleData
        the data to be used
    -encoding charsetName
        specifies the encoding which should be used for reading and writing 
text. If not specified the system default will be used.

bionlp2004 format usage: -types DNA,protein,cell_type,cell_line,RNA -lang 
language -data sampleData [-encoding charsetName]

Arguments description:
    -types DNA,protein,cell_type,cell_line,RNA
    -lang language
        specifies the language which is being processed.
    -data sampleData
        the data to be used
    -encoding charsetName
        specifies the encoding which should be used for reading and writing 
text. If not specified the system default will be used.

conll03 format usage: -lang en|de -types per,loc,org,misc -data sampleData 
[-encoding charsetName]

Arguments description:
    -lang en|de
    -types per,loc,org,misc
    -data sampleData
        the data to be used
    -encoding charsetName
        specifies the encoding which should be used for reading and writing 
text. If not specified the system default will be used.

conll02 format usage: -lang es|nl -types per,loc,org,misc -data sampleData 
[-encoding charsetName]

Arguments description:
    -lang es|nl
    -types per,loc,org,misc
    -data sampleData
        the data to be used
    -encoding charsetName
        specifies the encoding which should be used for reading and writing 
text. If not specified the system default will be used.

ad format usage: -lang language -data sampleData [-encoding charsetName]

-----------

Specifying -format didn't help, do I need to have certain order? And shouldn't 
it default to opennlp?

I don't mind having support for a format directly, but it should not make the 
default case difficult to use.

One disadvantage of this feature is, that it is easy to make mistakes which 
cannot be noticed,
e.g. encoding issues, problems in the format, etc. That may sound a bit 
strange, but some bugs in the UIMA
training integration came only to my attention after I added a feature to dump 
the training data to a file in the opennlp format.

Anyway, we still have the converters for debugging or the case that a user just 
wants to see the training data.
                
> CLI tools and formats refactored
> --------------------------------
>
>                 Key: OPENNLP-402
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-402
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Command Line Interface, Formats
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>              Labels: patch
>         Attachments: 0016-CLI-tools-and-formats-refactored.patch
>
>
> Proposed patch refactors CLI tools and simplifies the code by introducing 
> hierarchy and removing a lot of code duplication. It also introduces better 
> error and help messages, including help for formats and listing available 
> formats in various tools, which are now able to work with formats directly. 
> This, in turn, eliminates the need to keep converted files on disk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to