Re: UIMA TokenizerTrainer component : the model file is not created

Nicolas Hernandez Wed, 22 Jun 2011 03:06:04 -0700

Tommaso,

Concerning the sentence boundaries detection problem: After asking
Jörn, I opened the following jira [1]


Regards

/Nicolas

[1] https://issues.apache.org/jira/browse/OPENNLP-203


On Mon, Jun 20, 2011 at 11:14 AM, Tommaso Teofili
<tommaso.teof...@gmail.com> wrote:
> Hello Nicolas,
>
> 2011/6/17 Nicolas Hernandez <nicolas.hernan...@gmail.com>
>
>> Tommaso you said you successfully used the OpenNLP UIMA trainers.
>>
>> I am currently attempting to build French models for the various tasks
>> OpenNLP can deal with. But since I am also involved in UIMA stuff, I
>> wanted to test the OpenNLP UIMA components for doing that.
>> My goal is to donate the models to the OpenNLP community (i.e. in
>> http://opennlp.sourceforge.net/models-1.5/)
>>
>> Before testing the tokenizerTrainer, I tested the SentenceDetector. I
>> found at least two problems with the UIMA component
>> https://issues.apache.org/jira/browse/OPENNLP-197
>> One of them is not yet referenced in the jira. But I am currious to
>> know whether you encountered it.
>>
>> I noted that models trained with the UIMA component give wrong
>> begin/end offset despite the fact they manage to split text in
>> sentences. I observed that the begin of a current sentence starts
>> including as a first token the punctuation character of the previous
>> one while the
>> previous one does not include it as its last one.
>>
>> Have you noticed the problem ?
>>
>
> I didn't noticed that but I will rerun my tests to check it out, I may have
> missed that.
> I'll let you know how it goes.
> Regards,
> Tommaso
>
>
>>
>> I think that, most of all, my problems are due to the lack of
>> documentation for the uima integration. I plan to blog post about my
>> experience. Since I see there is an open issue for that
>> https://issues.apache.org/jira/browse/OPENNLP-49, if I manage to find
>> the time to blog spot, I can try to write it in some way it can also
>> be used to contribute to the documentation too (if you are interested
>> in).
>>
>>
>>
>> On Thu, Jun 16, 2011 at 3:52 PM, Nicolas Hernandez
>> <nicolas.hernan...@gmail.com> wrote:
>> > Hello Tommaso,
>> >
>> > after some more tests... I think I have found how to reproduce my
>> problem.
>> >
>> > Tommaso, you re right it works fine with the pipeline you described
>> > (i.e. with the WhitespaceTokenizer followed by the token trainer
>> > (wst-tokenTrainer-AAE)) but only if the input texts are formatted as
>> > 'normal' texts...
>> > I tested the pipeline with texts already formatted in a 'wst' way (a
>> > sentence per line and tokens separated by a whitespace character) and
>> > like that it does not work any longer (despite the presence of the
>> > sentence and token annotations).
>> >
>> > So my guess is that in command line the tokenTrainer needs to input a
>> > wst format (with '<SPLIT>' tags) but the opennlp uima tokenTrainer
>> > needs (in some way a 'detokenized' text).
>> >
>> > If needed, I can open a 'question' issue and attach the texts I used
>> > to produce the problem.
>> >
>> > /Nicolas
>> >
>> > ---------- Forwarded message ----------
>> > From: Tommaso Teofili <tommaso.teof...@gmail.com>
>> > Date: Wed, Jun 15, 2011 at 5:30 PM
>> > Subject: Re: UIMA TokenizerTrainer component : the model file is not
>> created
>> > To: opennlp-users@incubator.apache.org, nicolas.hernan...@univ-nantes.fr
>> >
>> >
>> > Hello Nicolas,
>> > I successfully used the OpenNLP UIMA TokenizerTrainer and also the
>> > other trainers, for a simple proof I created an aggregate analysis
>> > engine descriptor with the UIMA WhitespaceTokenizer and the OpenNLP
>> > TokenizerTrainer in a fixed flow, then used a
>> > FileSystemCollectionReader to to feed the pipeline.
>> > In the TokenizerTrainer I set:
>> >         <nameValuePair>
>> >   <name>opennlp.uima.TokenType</name>
>> >   <value>
>> >      <string>org.apache.uima.TokenAnnotation</string>
>> >   </value>
>> > </nameValuePair>
>> >         <nameValuePair>
>> >   <name>opennlp.uima.language</name>
>> >   <value>
>> >      <string>en-US</string>
>> >   </value>
>> > </nameValuePair>
>> >         <nameValuePair>
>> >   <name>opennlp.uima.ModelName</name>
>> >   <value>
>> >      <string>target/Tokens.bin</string>
>> >   </value>
>> > </nameValuePair>
>> >
>> > which then created the Tokens.bin model that I was able to test from
>> > command line and via APIs.
>> > Are you using it in a different way?
>> > Regards,
>> > Tommaso
>> >
>> > 2011/6/15 Nicolas Hernandez <nicolas.hernan...@gmail.com>
>> >>
>> >> Hello
>> >>
>> >> Does someone have already used the UIMA TokenizerTrainer component ? I
>> >> am a bit confused since it does not create any model file.
>> >>
>> >> In my stdout I got this :
>> >> Indexing events using cutoff of 5
>> >>        Computing event counts...
>> >>
>> >> done. 69669 events
>> >>        Indexing...  done.
>> >> Sorting and merging events... done. Reduced 69669 events to 16467.
>> >> Done indexing.
>> >> Incorporating indexed data for training...
>> >> done.
>> >>        Number of Event Tokens: 16467
>> >>            Number of Outcomes: 1
>> >>          Number of Predicates: 5624
>> >> ...done.
>> >> Computing model parameters...
>> >> Performing 100 iterations.
>> >>  1:  .. loglikelihood=0.0      1.0
>> >>  2:  .. loglikelihood=0.0      1.0
>> >>
>> >> This look like a problem I got when I trained the model in command
>> >> line without using the '<SPLIT>' tag. In command line, It differs
>> >> since in command line I also got the following exception
>> >> Exception in thread "main" java.lang.IllegalArgumentException: The
>> >> maxent model is not compatible!
>> >>
>> >> I solved this problem by adding the tag as it is mentioned in the post
>> >> of maxent model is not compatible with Tokenizer training       Fri, 13
>> May,
>> >> 09:33
>> >>
>> http://mail-archives.apache.org/mod_mbox/incubator-opennlp-users/201105.mbox/browser
>> >>
>> >> Does anyone know if it is the same problem ? In that case, how to
>> >> specify the '<SPLIT>' tag in the UIMA version? As much as I understand
>> >> its role, it is important to let the user the possibility of setting
>> >> it.
>> >>
>> >> More globaly I am interested by any return on experience of people who
>> >> successfully managed to build models with the UIMA OpenNLP * Trainer
>> >> components. For now, I also got some trouble with the SentenceTrainer
>> >> and I do not have test the others.
>> >>
>> >> /Nicolas
>> >>
>> >>
>> >> --
>> >> nicolas.hernan...@univ-nantes.fr
>> >> #
>> >> http://enicolashernandez.blogspot.com
>> >> http://www.univ-nantes.fr/hernandez-n
>> >> #
>> >> Laboratoire LINA-TALN CNRS UMR 6241
>> >> tel. +33 (0)2 51 12 58 55
>> >> #
>> >> Université de Nantes - Institut Universitaire de Technologie -
>> >> Département Informatique
>> >> tel. +33 (0)2 40 30 60 67
>> >
>> >
>> >
>> >
>> > --
>> > nicolas.hernan...@univ-nantes.fr
>> > #
>> > http://enicolashernandez.blogspot.com
>> > http://www.univ-nantes.fr/hernandez-n
>> > #
>> > Laboratoire LINA-TALN CNRS UMR 6241
>> > tel. +33 (0)2 51 12 58 55
>> > #
>> > Université de Nantes - Institut Universitaire de Technologie -
>> > Département Informatique
>> > tel. +33 (0)2 40 30 60 67
>> >
>>
>>
>>
>> --
>> nicolas.hernan...@univ-nantes.fr
>> #
>> http://enicolashernandez.blogspot.com
>> http://www.univ-nantes.fr/hernandez-n
>> #
>> Laboratoire LINA-TALN CNRS UMR 6241
>> tel. +33 (0)2 51 12 58 55
>> #
>> Université de Nantes - Institut Universitaire de Technologie -
>> Département Informatique
>> tel. +33 (0)2 40 30 60 67
>>
>



-- 
nicolas.hernan...@univ-nantes.fr
#
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
#
Laboratoire Informatique de Nantes Atlantique CNRS UMR 6241
tel. +33 (0)2 51 12 58 55
#
Université de Nantes - Institut Universitaire de Technologie -
Département Informatique
tel. +33 (0)2 40 30 60 67

Re: UIMA TokenizerTrainer component : the model file is not created

Reply via email to