Tommaso, Concerning the sentence boundaries detection problem: After asking Jörn, I opened the following jira [1]
Regards /Nicolas [1] https://issues.apache.org/jira/browse/OPENNLP-203 On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote: > Hello Nicolas, > > 2011/6/17 Nicolas Hernandez <nicolas.hernan...@gmail.com> > >> Tommaso you said you successfully used the OpenNLP UIMA trainers. >> >> I am currently attempting to build French models for the various tasks >> OpenNLP can deal with. But since I am also involved in UIMA stuff, I >> wanted to test the OpenNLP UIMA components for doing that. >> My goal is to donate the models to the OpenNLP community (i.e. in >> http://opennlp.sourceforge.net/models-1.5/) >> >> Before testing the tokenizerTrainer, I tested the SentenceDetector. I >> found at least two problems with the UIMA component >> https://issues.apache.org/jira/browse/OPENNLP-197 >> One of them is not yet referenced in the jira. But I am currious to >> know whether you encountered it. >> >> I noted that models trained with the UIMA component give wrong >> begin/end offset despite the fact they manage to split text in >> sentences. I observed that the begin of a current sentence starts >> including as a first token the punctuation character of the previous >> one while the >> previous one does not include it as its last one. >> >> Have you noticed the problem ? >> > > I didn't noticed that but I will rerun my tests to check it out, I may have > missed that. > I'll let you know how it goes. > Regards, > Tommaso > > >> >> I think that, most of all, my problems are due to the lack of >> documentation for the uima integration. I plan to blog post about my >> experience. Since I see there is an open issue for that >> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find >> the time to blog spot, I can try to write it in some way it can also >> be used to contribute to the documentation too (if you are interested >> in). >> >> >> >> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez >> <nicolas.hernan...@gmail.com> wrote: >> > Hello Tommaso, >> > >> > after some more tests... I think I have found how to reproduce my >> problem. >> > >> > Tommaso, you re right it works fine with the pipeline you described >> > (i.e. with the WhitespaceTokenizer followed by the token trainer >> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as >> > 'normal' texts... >> > I tested the pipeline with texts already formatted in a 'wst' way (a >> > sentence per line and tokens separated by a whitespace character) and >> > like that it does not work any longer (despite the presence of the >> > sentence and token annotations). >> > >> > So my guess is that in command line the tokenTrainer needs to input a >> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer >> > needs (in some way a 'detokenized' text). >> > >> > If needed, I can open a 'question' issue and attach the texts I used >> > to produce the problem. >> > >> > /Nicolas >> > >> > ---------- Forwarded message ---------- >> > From: Tommaso Teofili <tommaso.teof...@gmail.com> >> > Date: Wed, Jun 15, 2011 at 5:30 PM >> > Subject: Re: UIMA TokenizerTrainer component : the model file is not >> created >> > To: opennlp-users@incubator.apache.org, nicolas.hernan...@univ-nantes.fr >> > >> > >> > Hello Nicolas, >> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the >> > other trainers, for a simple proof I created an aggregate analysis >> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP >> > TokenizerTrainer in a fixed flow, then used a >> > FileSystemCollectionReader to to feed the pipeline. >> > In the TokenizerTrainer I set: >> > <nameValuePair> >> > <name>opennlp.uima.TokenType</name> >> > <value> >> > <string>org.apache.uima.TokenAnnotation</string> >> > </value> >> > </nameValuePair> >> > <nameValuePair> >> > <name>opennlp.uima.language</name> >> > <value> >> > <string>en-US</string> >> > </value> >> > </nameValuePair> >> > <nameValuePair> >> > <name>opennlp.uima.ModelName</name> >> > <value> >> > <string>target/Tokens.bin</string> >> > </value> >> > </nameValuePair> >> > >> > which then created the Tokens.bin model that I was able to test from >> > command line and via APIs. >> > Are you using it in a different way? >> > Regards, >> > Tommaso >> > >> > 2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com> >> >> >> >> Hello >> >> >> >> Does someone have already used the UIMA TokenizerTrainer component ? I >> >> am a bit confused since it does not create any model file. >> >> >> >> In my stdout I got this : >> >> Indexing events using cutoff of 5 >> >> Computing event counts... >> >> >> >> done. 69669 events >> >> Indexing... done. >> >> Sorting and merging events... done. Reduced 69669 events to 16467. >> >> Done indexing. >> >> Incorporating indexed data for training... >> >> done. >> >> Number of Event Tokens: 16467 >> >> Number of Outcomes: 1 >> >> Number of Predicates: 5624 >> >> ...done. >> >> Computing model parameters... >> >> Performing 100 iterations. >> >> 1: .. loglikelihood=0.0 1.0 >> >> 2: .. loglikelihood=0.0 1.0 >> >> >> >> This look like a problem I got when I trained the model in command >> >> line without using the '<SPLIT>' tag. In command line, It differs >> >> since in command line I also got the following exception >> >> Exception in thread "main" java.lang.IllegalArgumentException: The >> >> maxent model is not compatible! >> >> >> >> I solved this problem by adding the tag as it is mentioned in the post >> >> of maxent model is not compatible with Tokenizer training Fri, 13 >> May, >> >> 09:33 >> >> >> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser >> >> >> >> Does anyone know if it is the same problem ? In that case, how to >> >> specify the '<SPLIT>' tag in the UIMA version? As much as I understand >> >> its role, it is important to let the user the possibility of setting >> >> it. >> >> >> >> More globaly I am interested by any return on experience of people who >> >> successfully managed to build models with the UIMA OpenNLP * Trainer >> >> components. For now, I also got some trouble with the SentenceTrainer >> >> and I do not have test the others. >> >> >> >> /Nicolas >> >> >> >> >> >> -- >> >> nicolas.hernan...@univ-nantes.fr >> >> # >> >> http://enicolashernandez.blogspot.com >> >> http://www.univ-nantes.fr/hernandez-n >> >> # >> >> Laboratoire LINA-TALN CNRS UMR 6241 >> >> tel. +33 (0)2 51 12 58 55 >> >> # >> >> Université de Nantes - Institut Universitaire de Technologie - >> >> Département Informatique >> >> tel. +33 (0)2 40 30 60 67 >> > >> > >> > >> > >> > -- >> > nicolas.hernan...@univ-nantes.fr >> > # >> > http://enicolashernandez.blogspot.com >> > http://www.univ-nantes.fr/hernandez-n >> > # >> > Laboratoire LINA-TALN CNRS UMR 6241 >> > tel. +33 (0)2 51 12 58 55 >> > # >> > Université de Nantes - Institut Universitaire de Technologie - >> > Département Informatique >> > tel. +33 (0)2 40 30 60 67 >> > >> >> >> >> -- >> nicolas.hernan...@univ-nantes.fr >> # >> http://enicolashernandez.blogspot.com >> http://www.univ-nantes.fr/hernandez-n >> # >> Laboratoire LINA-TALN CNRS UMR 6241 >> tel. +33 (0)2 51 12 58 55 >> # >> Université de Nantes - Institut Universitaire de Technologie - >> Département Informatique >> tel. +33 (0)2 40 30 60 67 >> > -- nicolas.hernan...@univ-nantes.fr # http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n # Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241 tel. +33 (0)2 51 12 58 55 # Université de Nantes - Institut Universitaire de Technologie - Département Informatique tel. +33 (0)2 40 30 60 67