Re: Doccat : Different tokenizers for training and categorizing?

Nicolas Hernandez Fri, 12 Apr 2013 05:49:00 -0700

Hi

I have tested using the binary of opennlp 1.5.3 RC 3. My results have
changed and seem to be more coherent with what I was expected.


Thanks

On Tue, Apr 2, 2013 at 10:47 PM, Jörn Kottmann <[email protected]> wrote:
> The DoccatTool now uses the WhitespaceTokenizer to tokenize
> the input text, see the issue here:
> https://issues.apache.org/jira/browse/OPENNLP-568
>
> Its fixed in trunk and will go into our next release candidate,
> please test if that fixes your issue.
>
> Jörn
>
>
> On 03/28/2013 03:10 PM, Nicolas Hernandez wrote:
>>
>> Dear all
>>
>> I have not tracked yet the whole process but because some unexpected
>> doccat results I looked a little bit at the code.
>>
>> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
>> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>>
>> This should not be the case. Both should use the same tokenizer; in
>> particular : The whitespace tokenizer !
>>
>> If not which one is used ?
>>
>> Best regards
>>
>> /Nicolas
>
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Doccat : Different tokenizers for training and categorizing?

Reply via email to