On 6/15/11 4:46 PM, Nicolas Hernandez wrote:
Hello

Does someone have already used the UIMA TokenizerTrainer component ? I
am a bit confused since it does not create any model file.

In my stdout I got this :
Indexing events using cutoff of 5
        Computing event counts...

done. 69669 events
        Indexing...  done.
Sorting and merging events... done. Reduced 69669 events to 16467.
Done indexing.
Incorporating indexed data for training...
done.
        Number of Event Tokens: 16467
            Number of Outcomes: 1
          Number of Predicates: 5624
...done.
Computing model parameters...
Performing 100 iterations.
   1:  .. loglikelihood=0.0     1.0
   2:  .. loglikelihood=0.0     1.0

This look like a problem I got when I trained the model in command
line without using the '<SPLIT>' tag. In command line, It differs
since in command line I also got the following exception
Exception in thread "main" java.lang.IllegalArgumentException: The
maxent model is not compatible!

I solved this problem by adding the tag as it is mentioned in the post
of maxent model is not compatible with Tokenizer training       Fri, 13 May,
09:33
  
http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser

Does anyone know if it is the same problem ? In that case, how to
specify the '<SPLIT>' tag in the UIMA version? As much as I understand
its role, it is important to let the user the possibility of setting
it.
The <SPLIT> tag is not supported by the UIMA trainer version, there you simply annotate your tokens with an UIMA annotation. The training code does not work when you annotate white space tokenized text, since then the training code cannot
figure out which tokens haven been written together and which not.

In UIMA you usually always want to work with the original text, which is usually not white space tokenized. To track the tokens, token annotations can be added to
the CAS.

I guess in your test the serialization code failed because the model only had one
outcome, that can be considered as a bug and should be fixed in some way.

Jörn

Reply via email to