Re: UIMA TokenizerTrainer component : the model file is not created

Nicolas Hernandez Thu, 16 Jun 2011 06:53:23 -0700

Hello Tommaso,

after some more tests... I think I have found how to reproduce my problem.


Tommaso, you re right it works fine with the pipeline you described
(i.e. with the WhitespaceTokenizer followed by the token trainer
(wst-tokenTrainer-AAE)) but only if the input texts are formatted as
'normal' texts...
I tested the pipeline with texts already formatted in a 'wst' way (a
sentence per line and tokens separated by a whitespace character) and
like that it does not work any longer (despite the presence of the
sentence and token annotations).

So my guess is that in command line the tokenTrainer needs to input a
wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
needs (in some way a 'detokenized' text).

If needed, I can open a 'question' issue and attach the texts I used
to produce the problem.

/Nicolas

---------- Forwarded message ----------
From: Tommaso Teofili <tommaso.teof...@gmail.com>
Date: Wed, Jun 15, 2011 at 5:30 PM
Subject: Re: UIMA TokenizerTrainer component : the model file is not created
To: opennlp-users@incubator.apache.org, nicolas.hernan...@univ-nantes.fr


Hello Nicolas,
I successfully used the OpenNLP UIMA TokenizerTrainer and also the
other trainers, for a simple proof I created an aggregate analysis
engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
TokenizerTrainer in a fixed flow, then used a
FileSystemCollectionReader to to feed the pipeline.
In the TokenizerTrainer I set:
        <nameValuePair>
  <name>opennlp.uima.TokenType</name>
  <value>
     <string>org.apache.uima.TokenAnnotation</string>
  </value>
</nameValuePair>
        <nameValuePair>
  <name>opennlp.uima.language</name>
  <value>
     <string>en-US</string>
  </value>
</nameValuePair>
        <nameValuePair>
  <name>opennlp.uima.ModelName</name>
  <value>
     <string>target/Tokens.bin</string>
  </value>
</nameValuePair>

which then created the Tokens.bin model that I was able to test from
command line and via APIs.
Are you using it in a different way?
Regards,
Tommaso

2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com>
>
> Hello
>
> Does someone have already used the UIMA TokenizerTrainer component ? I
> am a bit confused since it does not create any model file.
>
> In my stdout I got this :
> Indexing events using cutoff of 5
>        Computing event counts...
>
> done. 69669 events
>        Indexing...  done.
> Sorting and merging events... done. Reduced 69669 events to 16467.
> Done indexing.
> Incorporating indexed data for training...
> done.
>        Number of Event Tokens: 16467
>            Number of Outcomes: 1
>          Number of Predicates: 5624
> ...done.
> Computing model parameters...
> Performing 100 iterations.
>  1:  .. loglikelihood=0.0      1.0
>  2:  .. loglikelihood=0.0      1.0
>
> This look like a problem I got when I trained the model in command
> line without using the '<SPLIT>' tag. In command line, It differs
> since in command line I also got the following exception
> Exception in thread "main" java.lang.IllegalArgumentException: The
> maxent model is not compatible!
>
> I solved this problem by adding the tag as it is mentioned in the post
> of maxent model is not compatible with Tokenizer training       Fri, 13 May,
> 09:33
>  http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>
> Does anyone know if it is the same problem ? In that case, how to
> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
> its role, it is important to let the user the possibility of setting
> it.
>
> More globaly I am interested by any return on experience of people who
> successfully managed to build models with the UIMA OpenNLP * Trainer
> components. For now, I also got some trouble with the SentenceTrainer
> and I do not have test the others.
>
> /Nicolas
>
>
> --
> nicolas.hernan...@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire LINA-TALN CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67




-- 
nicolas.hernan...@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire LINA-TALN CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Reply via email to