On 6/15/11 4:46 PM, Nicolas Hernandez wrote:
Hello
Does someone have already used the UIMA TokenizerTrainer component ? I
am a bit confused since it does not create any model file.
In my stdout I got this :
Indexing events using cutoff of 5
Computing event counts...
done. 69669 events
Indexing... done.
Sorting and merging events... done. Reduced 69669 events to 16467.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 16467
Number of Outcomes: 1
Number of Predicates: 5624
...done.
Computing model parameters...
Performing 100 iterations.
1: .. loglikelihood=0.0 1.0
2: .. loglikelihood=0.0 1.0
This look like a problem I got when I trained the model in command
line without using the '<SPLIT>' tag. In command line, It differs
since in command line I also got the following exception
Exception in thread "main" java.lang.IllegalArgumentException: The
maxent model is not compatible!
I solved this problem by adding the tag as it is mentioned in the post
of maxent model is not compatible with Tokenizer training Fri, 13 May,
09:33
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
Does anyone know if it is the same problem ? In that case, how to
specify the '<SPLIT>' tag in the UIMA version? As much as I understand
its role, it is important to let the user the possibility of setting
it.
The <SPLIT> tag is not supported by the UIMA trainer version, there you
simply
annotate your tokens with an UIMA annotation. The training code does not
work
when you annotate white space tokenized text, since then the training
code cannot
figure out which tokens haven been written together and which not.
In UIMA you usually always want to work with the original text, which is
usually
not white space tokenized. To track the tokens, token annotations can be
added to
the CAS.
I guess in your test the serialization code failed because the model
only had one
outcome, that can be considered as a bug and should be fixed in some way.
Jörn