[jira] [Commented] (OPENNLP-402) CLI tools and formats refactored

Aliaksandr Autayeu (Commented) (JIRA) Thu, 24 Nov 2011 04:23:06 -0800

    [ 
https://issues.apache.org/jira/browse/OPENNLP-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156655#comment-13156655
 ]


Aliaksandr Autayeu commented on OPENNLP-402:
--------------------------------------------

I have tested it with POS tagger and Tokenizer (and with my own formats as 
well, to avoid conversion, but I didn't include the support of third-party 
formats in this patch). Here is the command for conllx, which I also used for 
testing:

opennlp POSTaggerTrainer -model pt.bin -format conllx -data 
portuguese_bosque_train.conll -encoding UTF-8 -lang pt

The idea is that first go the parameters which belong to the training itself, 
such as model, iterations, etc. Second, you choose the format of your data and 
specify all that relates to the data: location (might be a database...), 
encoding, languages, etc... Of course, this may vary from format to format and 
therefore I list the formats available and their help separately.

I was thinking about opennlp as a default format, but didn't find an elegant 
solution, which applies well to all tools. As far as I know, the changes to the 
default case is 
1) -format opennlp
2) putting format related parameters after the format (so they are passed in 
the factory)

One of the reasons why I did not go for full backward compatibility (although 
it can be done) and making opennlp format the default one is that I saw a trend 
toward separating parameters anyway: those iterations and cutoffs being 
grouped, support for the parameter file... So there are already groups of 
parameters which are logically different and even validated in different 
places, therefore having an explicit separator between them will, for examepl, 
make users aware of these groups and might ease understanding of possible 
parameters values and error messages.

====
One disadvantage of this feature is, that it is easy to make mistakes which 
cannot be noticed, 
e.g. encoding issues, problems in the format, etc. 
===
Can you elaborate exactly how this is a disadvantage of having direct format 
support? I agree that conversion might reveal something (see below), but that's 
the advantage of the conversion, not the disadvantage of having direct format 
support.

====
That may sound a bit strange, but some bugs in the UIMA 
training integration came only to my attention after I added a feature to dump 
the training data to a file in the opennlp format. 
====
Here I fully understand and agree with you, it is aligned with my own 
experience. Format conversion itself often reveals some bugs in the data, or in 
the parser, whether on the source or the target side. When possible, I even did 
a round-trip conversion for the datasets I have just to discover some special 
cases in the data.


===
Do you know a cli tool which has a similar problem? 
===
I'm sorry, I didn't get you here. Which problem do you refer here to?

===
We need to adjust the possible arguments based on the format the user is 
choosing.
===
Yes, that's true and the factory for the format checks the parameters, printing 
an error message. I incorporated here your propagation of validate....Loudly. 
Did I misunderstand something?

===
Just some idea, but couldn't we do it like this: 
bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ... 
===
In general, implicit format choice removes the need of 1 extra argument. And 
although I like it and thought of it, the problem is that sometimes it is not 
possible to rename the files - they are on the CD-ROM, or read-only network 
share, or in the database which has a JDBC URL.

Oops, now I get you on the last point. I left the paragraph above, it might 
illustrate some of my thoughts.
===
bin/opennlp TokenNameFinderCrossValidator.bionlp2004 -data ...
===
Yes, adding the format to the tool name is an option. It'll change the parser a 
bit and will introduce two different types of parameters. But if the goal is to 
maximize the backwards compatibility (which was broken anyway with the renaming 
of -model-type to -type) then it might be considered.

Let me know if the patch needs elaboration.

Also, if you have ideas on how to ease the integration of the formats 
(checking, validation), to make it easier to work with them - I'm interested. 
But yes, in general there are converters for debugging or data screening.
                
> CLI tools and formats refactored
> --------------------------------
>
>                 Key: OPENNLP-402
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-402
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Command Line Interface, Formats
>    Affects Versions: tools-1.5.3-incubating
>            Reporter: Aliaksandr Autayeu
>              Labels: patch
>         Attachments: 0016-CLI-tools-and-formats-refactored.patch
>
>
> Proposed patch refactors CLI tools and simplifies the code by introducing 
> hierarchy and removing a lot of code duplication. It also introduces better 
> error and help messages, including help for formats and listing available 
> formats in various tools, which are now able to work with formats directly. 
> This, in turn, eliminates the need to keep converted files on disk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OPENNLP-402) CLI tools and formats refactored

Reply via email to