Re: Doccat : Different tokenizers for training and categorizing?

Jörn Kottmann Tue, 02 Apr 2013 01:42:17 -0700

Yes, thats a bug in the Doccat Tool, both tools should process theOpenNLP default format which

is a document per line and whitespace tokenized.

The trainer seems to work fine, and the DoccatTool needs to use theWhitespace tokenizer instead

of the Simple Tokenizer. Thanks for figuring that out!


Nicolas, do you mind to open a jira?

Jörn

On 03/29/2013 03:55 AM, William Colen wrote:

In my opinion you are right. It would be safer to use whitespace tokenizer
than SimpleTokenizer.

But I could not check if DoccatTrainerTool is using whitespace tokenizer.
Actually, the only DocumentSample provider we have today is the one that
reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer
because the entries are not tokenized.



On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez <
[email protected]> wrote:

Dear all

I have not tracked yet the whole process but because some unexpected
doccat results I looked a little bit at the code.

Do you confirm that the DoccatTrainerTool whitespace tokenize (by
creating DocumentSample) while the DoccatTool "SimpleTokenize" ?

This should not be the case. Both should use the same tokenizer; in
particular : The whitespace tokenizer !

If not which one is used ?

Best regards

/Nicolas

Re: Doccat : Different tokenizers for training and categorizing?

Reply via email to