Re: UIMA TokenizerTrainer component : the model file is not created

Tommaso Teofili Wed, 22 Jun 2011 03:12:07 -0700

Nicolas,
After re-training the sentence detector with OpenNLP UIMA I noticed the
problem while using the command line tools, I didn't notice that.
Regards,
Tommaso


2011/6/22 Nicolas Hernandez <nicolas.hernan...@gmail.com>

> Tommaso,
>
> Concerning the sentence boundaries detection problem: After asking
> Jörn, I opened the following jira [1]
>
> Regards
>
> /Nicolas
>
> [1] https://issues.apache.org/jira/browse/OPENNLP-203
>
>
> On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili
> <tommaso.teof...@gmail.com> wrote:
> > Hello Nicolas,
> >
> > 2011/6/17 Nicolas Hernandez <nicolas.hernan...@gmail.com>
> >
> >> Tommaso you said you successfully used the OpenNLP UIMA trainers.
> >>
> >> I am currently attempting to build French models for the various tasks
> >> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
> >> wanted to test the OpenNLP UIMA components for doing that.
> >> My goal is to donate the models to the OpenNLP community (i.e. in
> >> http://opennlp.sourceforge.net/models-1.5/)
> >>
> >> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
> >> found at least two problems with the UIMA component
> >> https://issues.apache.org/jira/browse/OPENNLP-197
> >> One of them is not yet referenced in the jira. But I am currious to
> >> know whether you encountered it.
> >>
> >> I noted that models trained with the UIMA component give wrong
> >> begin/end offset despite the fact they manage to split text in
> >> sentences. I observed that the begin of a current sentence starts
> >> including as a first token the punctuation character of the previous
> >> one while the
> >> previous one does not include it as its last one.
> >>
> >> Have you noticed the problem ?
> >>
> >
> > I didn't noticed that but I will rerun my tests to check it out, I may
> have
> > missed that.
> > I'll let you know how it goes.
> > Regards,
> > Tommaso
> >
> >
> >>
> >> I think that, most of all, my problems are due to the lack of
> >> documentation for the uima integration. I plan to blog post about my
> >> experience. Since I see there is an open issue for that
> >> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
> >> the time to blog spot, I can try to write it in some way it can also
> >> be used to contribute to the documentation too (if you are interested
> >> in).
> >>
> >>
> >>
> >> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
> >> <nicolas.hernan...@gmail.com> wrote:
> >> > Hello Tommaso,
> >> >
> >> > after some more tests... I think I have found how to reproduce my
> >> problem.
> >> >
> >> > Tommaso, you re right it works fine with the pipeline you described
> >> > (i.e. with the WhitespaceTokenizer followed by the token trainer
> >> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
> >> > 'normal' texts...
> >> > I tested the pipeline with texts already formatted in a 'wst' way (a
> >> > sentence per line and tokens separated by a whitespace character) and
> >> > like that it does not work any longer (despite the presence of the
> >> > sentence and token annotations).
> >> >
> >> > So my guess is that in command line the tokenTrainer needs to input a
> >> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
> >> > needs (in some way a 'detokenized' text).
> >> >
> >> > If needed, I can open a 'question' issue and attach the texts I used
> >> > to produce the problem.
> >> >
> >> > /Nicolas
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Tommaso Teofili <tommaso.teof...@gmail.com>
> >> > Date: Wed, Jun 15, 2011 at 5:30 PM
> >> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
> >> created
> >> > To: opennlp-users@incubator.apache.org,
> nicolas.hernan...@univ-nantes.fr
> >> >
> >> >
> >> > Hello Nicolas,
> >> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
> >> > other trainers, for a simple proof I created an aggregate analysis
> >> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
> >> > TokenizerTrainer in a fixed flow, then used a
> >> > FileSystemCollectionReader to to feed the pipeline.
> >> > In the TokenizerTrainer I set:
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.TokenType</name>
> >> >   <value>
> >> >      <string>org.apache.uima.TokenAnnotation</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.language</name>
> >> >   <value>
> >> >      <string>en-US</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >         <nameValuePair>
> >> >   <name>opennlp.uima.ModelName</name>
> >> >   <value>
> >> >      <string>target/Tokens.bin</string>
> >> >   </value>
> >> > </nameValuePair>
> >> >
> >> > which then created the Tokens.bin model that I was able to test from
> >> > command line and via APIs.
> >> > Are you using it in a different way?
> >> > Regards,
> >> > Tommaso
> >> >
> >> > 2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com>
> >> >>
> >> >> Hello
> >> >>
> >> >> Does someone have already used the UIMA TokenizerTrainer component ?
> I
> >> >> am a bit confused since it does not create any model file.
> >> >>
> >> >> In my stdout I got this :
> >> >> Indexing events using cutoff of 5
> >> >>        Computing event counts...
> >> >>
> >> >> done. 69669 events
> >> >>        Indexing...  done.
> >> >> Sorting and merging events... done. Reduced 69669 events to 16467.
> >> >> Done indexing.
> >> >> Incorporating indexed data for training...
> >> >> done.
> >> >>        Number of Event Tokens: 16467
> >> >>            Number of Outcomes: 1
> >> >>          Number of Predicates: 5624
> >> >> ...done.
> >> >> Computing model parameters...
> >> >> Performing 100 iterations.
> >> >>  1:  .. loglikelihood=0.0      1.0
> >> >>  2:  .. loglikelihood=0.0      1.0
> >> >>
> >> >> This look like a problem I got when I trained the model in command
> >> >> line without using the '<SPLIT>' tag. In command line, It differs
> >> >> since in command line I also got the following exception
> >> >> Exception in thread "main" java.lang.IllegalArgumentException: The
> >> >> maxent model is not compatible!
> >> >>
> >> >> I solved this problem by adding the tag as it is mentioned in the
> post
> >> >> of maxent model is not compatible with Tokenizer training       Fri,
> 13
> >> May,
> >> >> 09:33
> >> >>
> >>
> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
> >> >>
> >> >> Does anyone know if it is the same problem ? In that case, how to
> >> >> specify the '<SPLIT>' tag in the UIMA version? As much as I
> understand
> >> >> its role, it is important to let the user the possibility of setting
> >> >> it.
> >> >>
> >> >> More globaly I am interested by any return on experience of people
> who
> >> >> successfully managed to build models with the UIMA OpenNLP * Trainer
> >> >> components. For now, I also got some trouble with the SentenceTrainer
> >> >> and I do not have test the others.
> >> >>
> >> >> /Nicolas
> >> >>
> >> >>
> >> >> --
> >> >> nicolas.hernan...@univ-nantes.fr
> >> >> #
> >> >> http://enicolashernandez.blogspot.com
> >> >> http://www.univ-nantes.fr/hernandez-n
> >> >> #
> >> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> >> tel. +33 (0)2 51 12 58 55
> >> >> #
> >> >> Université de Nantes - Institut Universitaire de Technologie -
> >> >> Département Informatique
> >> >> tel. +33 (0)2 40 30 60 67
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > nicolas.hernan...@univ-nantes.fr
> >> > #
> >> > http://enicolashernandez.blogspot.com
> >> > http://www.univ-nantes.fr/hernandez-n
> >> > #
> >> > Laboratoire LINA-TALN CNRS UMR 6241
> >> > tel. +33 (0)2 51 12 58 55
> >> > #
> >> > Université de Nantes - Institut Universitaire de Technologie -
> >> > Département Informatique
> >> > tel. +33 (0)2 40 30 60 67
> >> >
> >>
> >>
> >>
> >> --
> >> nicolas.hernan...@univ-nantes.fr
> >> #
> >> http://enicolashernandez.blogspot.com
> >> http://www.univ-nantes.fr/hernandez-n
> >> #
> >> Laboratoire LINA-TALN CNRS UMR 6241
> >> tel. +33 (0)2 51 12 58 55
> >> #
> >> Université de Nantes - Institut Universitaire de Technologie -
> >> Département Informatique
> >> tel. +33 (0)2 40 30 60 67
> >>
> >
>
>
>
> --
> nicolas.hernan...@univ-nantes.fr
> #
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> #
> Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241
> tel. +33 (0)2 51 12 58 55
> #
> Université de Nantes - Institut Universitaire de Technologie -
> Département Informatique
> tel. +33 (0)2 40 30 60 67
>

Re: UIMA TokenizerTrainer component : the model file is not created

Reply via email to