Re: Doccat : Different tokenizers for training and categorizing?

Jörn Kottmann Tue, 02 Apr 2013 13:47:48 -0700

The DoccatTool now uses the WhitespaceTokenizer to tokenize
the input text, see the issue here:
https://issues.apache.org/jira/browse/OPENNLP-568


Its fixed in trunk and will go into our next release candidate,
please test if that fixes your issue.

Jörn

On 03/28/2013 03:10 PM, Nicolas Hernandez wrote:

Dear all

I have not tracked yet the whole process but because some unexpected
doccat results I looked a little bit at the code.

Do you confirm that the DoccatTrainerTool whitespace tokenize (by
creating DocumentSample) while the DoccatTool "SimpleTokenize" ?

This should not be the case. Both should use the same tokenizer; in
particular : The whitespace tokenizer !

If not which one is used ?

Best regards

/Nicolas

Re: Doccat : Different tokenizers for training and categorizing?

Reply via email to