Nicolas, After re-training the sentence detector with OpenNLP UIMA I noticed the problem while using the command line tools, I didn't notice that. Regards, Tommaso
2011/6/22 Nicolas Hernandez <nicolas.hernan...@gmail.com> > Tommaso, > > Concerning the sentence boundaries detection problem: After asking > Jörn, I opened the following jira [1] > > Regards > > /Nicolas > > [1] https://issues.apache.org/jira/browse/OPENNLP-203 > > > On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili > <tommaso.teof...@gmail.com> wrote: > > Hello Nicolas, > > > > 2011/6/17 Nicolas Hernandez <nicolas.hernan...@gmail.com> > > > >> Tommaso you said you successfully used the OpenNLP UIMA trainers. > >> > >> I am currently attempting to build French models for the various tasks > >> OpenNLP can deal with. But since I am also involved in UIMA stuff, I > >> wanted to test the OpenNLP UIMA components for doing that. > >> My goal is to donate the models to the OpenNLP community (i.e. in > >> http://opennlp.sourceforge.net/models-1.5/) > >> > >> Before testing the tokenizerTrainer, I tested the SentenceDetector. I > >> found at least two problems with the UIMA component > >> https://issues.apache.org/jira/browse/OPENNLP-197 > >> One of them is not yet referenced in the jira. But I am currious to > >> know whether you encountered it. > >> > >> I noted that models trained with the UIMA component give wrong > >> begin/end offset despite the fact they manage to split text in > >> sentences. I observed that the begin of a current sentence starts > >> including as a first token the punctuation character of the previous > >> one while the > >> previous one does not include it as its last one. > >> > >> Have you noticed the problem ? > >> > > > > I didn't noticed that but I will rerun my tests to check it out, I may > have > > missed that. > > I'll let you know how it goes. > > Regards, > > Tommaso > > > > > >> > >> I think that, most of all, my problems are due to the lack of > >> documentation for the uima integration. I plan to blog post about my > >> experience. Since I see there is an open issue for that > >> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find > >> the time to blog spot, I can try to write it in some way it can also > >> be used to contribute to the documentation too (if you are interested > >> in). > >> > >> > >> > >> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez > >> <nicolas.hernan...@gmail.com> wrote: > >> > Hello Tommaso, > >> > > >> > after some more tests... I think I have found how to reproduce my > >> problem. > >> > > >> > Tommaso, you re right it works fine with the pipeline you described > >> > (i.e. with the WhitespaceTokenizer followed by the token trainer > >> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as > >> > 'normal' texts... > >> > I tested the pipeline with texts already formatted in a 'wst' way (a > >> > sentence per line and tokens separated by a whitespace character) and > >> > like that it does not work any longer (despite the presence of the > >> > sentence and token annotations). > >> > > >> > So my guess is that in command line the tokenTrainer needs to input a > >> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer > >> > needs (in some way a 'detokenized' text). > >> > > >> > If needed, I can open a 'question' issue and attach the texts I used > >> > to produce the problem. > >> > > >> > /Nicolas > >> > > >> > ---------- Forwarded message ---------- > >> > From: Tommaso Teofili <tommaso.teof...@gmail.com> > >> > Date: Wed, Jun 15, 2011 at 5:30 PM > >> > Subject: Re: UIMA TokenizerTrainer component : the model file is not > >> created > >> > To: opennlp-users@incubator.apache.org, > nicolas.hernan...@univ-nantes.fr > >> > > >> > > >> > Hello Nicolas, > >> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the > >> > other trainers, for a simple proof I created an aggregate analysis > >> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP > >> > TokenizerTrainer in a fixed flow, then used a > >> > FileSystemCollectionReader to to feed the pipeline. > >> > In the TokenizerTrainer I set: > >> > <nameValuePair> > >> > <name>opennlp.uima.TokenType</name> > >> > <value> > >> > <string>org.apache.uima.TokenAnnotation</string> > >> > </value> > >> > </nameValuePair> > >> > <nameValuePair> > >> > <name>opennlp.uima.language</name> > >> > <value> > >> > <string>en-US</string> > >> > </value> > >> > </nameValuePair> > >> > <nameValuePair> > >> > <name>opennlp.uima.ModelName</name> > >> > <value> > >> > <string>target/Tokens.bin</string> > >> > </value> > >> > </nameValuePair> > >> > > >> > which then created the Tokens.bin model that I was able to test from > >> > command line and via APIs. > >> > Are you using it in a different way? > >> > Regards, > >> > Tommaso > >> > > >> > 2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com> > >> >> > >> >> Hello > >> >> > >> >> Does someone have already used the UIMA TokenizerTrainer component ? > I > >> >> am a bit confused since it does not create any model file. > >> >> > >> >> In my stdout I got this : > >> >> Indexing events using cutoff of 5 > >> >> Computing event counts... > >> >> > >> >> done. 69669 events > >> >> Indexing... done. > >> >> Sorting and merging events... done. Reduced 69669 events to 16467. > >> >> Done indexing. > >> >> Incorporating indexed data for training... > >> >> done. > >> >> Number of Event Tokens: 16467 > >> >> Number of Outcomes: 1 > >> >> Number of Predicates: 5624 > >> >> ...done. > >> >> Computing model parameters... > >> >> Performing 100 iterations. > >> >> 1: .. loglikelihood=0.0 1.0 > >> >> 2: .. loglikelihood=0.0 1.0 > >> >> > >> >> This look like a problem I got when I trained the model in command > >> >> line without using the '<SPLIT>' tag. In command line, It differs > >> >> since in command line I also got the following exception > >> >> Exception in thread "main" java.lang.IllegalArgumentException: The > >> >> maxent model is not compatible! > >> >> > >> >> I solved this problem by adding the tag as it is mentioned in the > post > >> >> of maxent model is not compatible with Tokenizer training Fri, > 13 > >> May, > >> >> 09:33 > >> >> > >> > http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser > >> >> > >> >> Does anyone know if it is the same problem ? In that case, how to > >> >> specify the '<SPLIT>' tag in the UIMA version? As much as I > understand > >> >> its role, it is important to let the user the possibility of setting > >> >> it. > >> >> > >> >> More globaly I am interested by any return on experience of people > who > >> >> successfully managed to build models with the UIMA OpenNLP * Trainer > >> >> components. For now, I also got some trouble with the SentenceTrainer > >> >> and I do not have test the others. > >> >> > >> >> /Nicolas > >> >> > >> >> > >> >> -- > >> >> nicolas.hernan...@univ-nantes.fr > >> >> # > >> >> http://enicolashernandez.blogspot.com > >> >> http://www.univ-nantes.fr/hernandez-n > >> >> # > >> >> Laboratoire LINA-TALN CNRS UMR 6241 > >> >> tel. +33 (0)2 51 12 58 55 > >> >> # > >> >> Université de Nantes - Institut Universitaire de Technologie - > >> >> Département Informatique > >> >> tel. +33 (0)2 40 30 60 67 > >> > > >> > > >> > > >> > > >> > -- > >> > nicolas.hernan...@univ-nantes.fr > >> > # > >> > http://enicolashernandez.blogspot.com > >> > http://www.univ-nantes.fr/hernandez-n > >> > # > >> > Laboratoire LINA-TALN CNRS UMR 6241 > >> > tel. +33 (0)2 51 12 58 55 > >> > # > >> > Université de Nantes - Institut Universitaire de Technologie - > >> > Département Informatique > >> > tel. +33 (0)2 40 30 60 67 > >> > > >> > >> > >> > >> -- > >> nicolas.hernan...@univ-nantes.fr > >> # > >> http://enicolashernandez.blogspot.com > >> http://www.univ-nantes.fr/hernandez-n > >> # > >> Laboratoire LINA-TALN CNRS UMR 6241 > >> tel. +33 (0)2 51 12 58 55 > >> # > >> Université de Nantes - Institut Universitaire de Technologie - > >> Département Informatique > >> tel. +33 (0)2 40 30 60 67 > >> > > > > > > -- > nicolas.hernan...@univ-nantes.fr > # > http://enicolashernandez.blogspot.com > http://www.univ-nantes.fr/hernandez-n > # > Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241 > tel. +33 (0)2 51 12 58 55 > # > Université de Nantes - Institut Universitaire de Technologie - > Département Informatique > tel. +33 (0)2 40 30 60 67 >