In my opinion you are right. It would be safer to use whitespace tokenizer
than SimpleTokenizer.

But I could not check if DoccatTrainerTool is using whitespace tokenizer.
Actually, the only DocumentSample provider we have today is the one that
reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer
because the entries are not tokenized.



On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez <
[email protected]> wrote:

> Dear all
>
> I have not tracked yet the whole process but because some unexpected
> doccat results I looked a little bit at the code.
>
> Do you confirm that the DoccatTrainerTool whitespace tokenize (by
> creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
>
> This should not be the case. Both should use the same tokenizer; in
> particular : The whitespace tokenizer !
>
> If not which one is used ?
>
> Best regards
>
> /Nicolas
>

Reply via email to