Hello Tommaso, after some more tests... I think I have found how to reproduce my problem.
Tommaso, you re right it works fine with the pipeline you described (i.e. with the WhitespaceTokenizer followed by the token trainer (wst-tokenTrainer-AAE)) but only if the input texts are formatted as 'normal' texts... I tested the pipeline with texts already formatted in a 'wst' way (a sentence per line and tokens separated by a whitespace character) and like that it does not work any longer (despite the presence of the sentence and token annotations). So my guess is that in command line the tokenTrainer needs to input a wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer needs (in some way a 'detokenized' text). If needed, I can open a 'question' issue and attach the texts I used to produce the problem. /Nicolas ---------- Forwarded message ---------- From: Tommaso Teofili <tommaso.teof...@gmail.com> Date: Wed, Jun 15, 2011 at 5:30 PM Subject: Re: UIMA TokenizerTrainer component : the model file is not created To: opennlp-users@incubator.apache.org, nicolas.hernan...@univ-nantes.fr Hello Nicolas, I successfully used the OpenNLP UIMA TokenizerTrainer and also the other trainers, for a simple proof I created an aggregate analysis engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP TokenizerTrainer in a fixed flow, then used a FileSystemCollectionReader to to feed the pipeline. In the TokenizerTrainer I set: <nameValuePair> <name>opennlp.uima.TokenType</name> <value> <string>org.apache.uima.TokenAnnotation</string> </value> </nameValuePair> <nameValuePair> <name>opennlp.uima.language</name> <value> <string>en-US</string> </value> </nameValuePair> <nameValuePair> <name>opennlp.uima.ModelName</name> <value> <string>target/Tokens.bin</string> </value> </nameValuePair> which then created the Tokens.bin model that I was able to test from command line and via APIs. Are you using it in a different way? Regards, Tommaso 2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com> > > Hello > > Does someone have already used the UIMA TokenizerTrainer component ? I > am a bit confused since it does not create any model file. > > In my stdout I got this : > Indexing events using cutoff of 5 > Computing event counts... > > done. 69669 events > Indexing... done. > Sorting and merging events... done. Reduced 69669 events to 16467. > Done indexing. > Incorporating indexed data for training... > done. > Number of Event Tokens: 16467 > Number of Outcomes: 1 > Number of Predicates: 5624 > ...done. > Computing model parameters... > Performing 100 iterations. > 1: .. loglikelihood=0.0 1.0 > 2: .. loglikelihood=0.0 1.0 > > This look like a problem I got when I trained the model in command > line without using the '<SPLIT>' tag. In command line, It differs > since in command line I also got the following exception > Exception in thread "main" java.lang.IllegalArgumentException: The > maxent model is not compatible! > > I solved this problem by adding the tag as it is mentioned in the post > of maxent model is not compatible with Tokenizer training Fri, 13 May, > 09:33 > http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser > > Does anyone know if it is the same problem ? In that case, how to > specify the '<SPLIT>' tag in the UIMA version? As much as I understand > its role, it is important to let the user the possibility of setting > it. > > More globaly I am interested by any return on experience of people who > successfully managed to build models with the UIMA OpenNLP * Trainer > components. For now, I also got some trouble with the SentenceTrainer > and I do not have test the others. > > /Nicolas > > > -- > nicolas.hernan...@univ-nantes.fr > # > http://enicolashernandez.blogspot.com > http://www.univ-nantes.fr/hernandez-n > # > Laboratoire LINA-TALN CNRS UMR 6241 > tel. +33 (0)2 51 12 58 55 > # > Université de Nantes - Institut Universitaire de Technologie - > Département Informatique > tel. +33 (0)2 40 30 60 67 -- nicolas.hernan...@univ-nantes.fr # http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n # Laboratoire LINA-TALN CNRS UMR 6241 tel. +33 (0)2 51 12 58 55 # Université de Nantes - Institut Universitaire de Technologie - Département Informatique tel. +33 (0)2 40 30 60 67