[
https://issues.apache.org/jira/browse/OPENNLP-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156655#comment-13156655
]
Aliaksandr Autayeu commented on OPENNLP-402:
--------------------------------------------
I have tested it with POS tagger and Tokenizer (and with my own formats as
well, to avoid conversion, but I didn't include the support of third-party
formats in this patch). Here is the command for conllx, which I also used for
testing:
opennlp POSTaggerTrainer -model pt.bin -format conllx -data
portuguese_bosque_train.conll -encoding UTF-8 -lang pt
The idea is that first go the parameters which belong to the training itself,
such as model, iterations, etc. Second, you choose the format of your data and
specify all that relates to the data: location (might be a database...),
encoding, languages, etc... Of course, this may vary from format to format and
therefore I list the formats available and their help separately.
I was thinking about opennlp as a default format, but didn't find an elegant
solution, which applies well to all tools. As far as I know, the changes to the
default case is
1) -format opennlp
2) putting format related parameters after the format (so they are passed in
the factory)
One of the reasons why I did not go for full backward compatibility (although
it can be done) and making opennlp format the default one is that I saw a trend
toward separating parameters anyway: those iterations and cutoffs being
grouped, support for the parameter file... So there are already groups of
parameters which are logically different and even validated in different
places, therefore having an explicit separator between them will, for examepl,
make users aware of these groups and might ease understanding of possible
parameters values and error messages.
====
One disadvantage of this feature is, that it is easy to make mistakes which
cannot be noticed,
e.g. encoding issues, problems in the format, etc.
===
Can you elaborate exactly how this is a disadvantage of having direct format
support? I agree that conversion might reveal something (see below), but that's
the advantage of the conversion, not the disadvantage of having direct format
support.
====
That may sound a bit strange, but some bugs in the UIMA
training integration came only to my attention after I added a feature to dump
the training data to a file in the opennlp format.
====
Here I fully understand and agree with you, it is aligned with my own
experience. Format conversion itself often reveals some bugs in the data, or in
the parser, whether on the source or the target side. When possible, I even did
a round-trip conversion for the datasets I have just to discover some special
cases in the data.
===
Do you know a cli tool which has a similar problem?
===
I'm sorry, I didn't get you here. Which problem do you refer here to?
===
We need to adjust the possible arguments based on the format the user is
choosing.
===
Yes, that's true and the factory for the format checks the parameters, printing
an error message. I incorporated here your propagation of validate....Loudly.
Did I misunderstand something?
===
Just some idea, but couldn't we do it like this:
bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...
===
In general, implicit format choice removes the need of 1 extra argument. And
although I like it and thought of it, the problem is that sometimes it is not
possible to rename the files - they are on the CD-ROM, or read-only network
share, or in the database which has a JDBC URL.
Oops, now I get you on the last point. I left the paragraph above, it might
illustrate some of my thoughts.
===
bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...
===
Yes, adding the format to the tool name is an option. It'll change the parser a
bit and will introduce two different types of parameters. But if the goal is to
maximize the backwards compatibility (which was broken anyway with the renaming
of -model-type to -type) then it might be considered.
Let me know if the patch needs elaboration.
Also, if you have ideas on how to ease the integration of the formats
(checking, validation), to make it easier to work with them - I'm interested.
But yes, in general there are converters for debugging or data screening.
> CLI tools and formats refactored
> --------------------------------
>
> Key: OPENNLP-402
> URL: https://issues.apache.org/jira/browse/OPENNLP-402
> Project: OpenNLP
> Issue Type: Improvement
> Components: Command Line Interface, Formats
> Affects Versions: tools-1.5.3-incubating
> Reporter: Aliaksandr Autayeu
> Labels: patch
> Attachments: 0016-CLI-tools-and-formats-refactored.patch
>
>
> Proposed patch refactors CLI tools and simplifies the code by introducing
> hierarchy and removing a lot of code duplication. It also introduces better
> error and help messages, including help for formats and listing available
> formats in various tools, which are now able to work with formats directly.
> This, in turn, eliminates the need to keep converted files on disk.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira