In my opinion you are right. It would be safer to use whitespace tokenizer than SimpleTokenizer.
But I could not check if DoccatTrainerTool is using whitespace tokenizer. Actually, the only DocumentSample provider we have today is the one that reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer because the entries are not tokenized. On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez < [email protected]> wrote: > Dear all > > I have not tracked yet the whole process but because some unexpected > doccat results I looked a little bit at the code. > > Do you confirm that the DoccatTrainerTool whitespace tokenize (by > creating DocumentSample) while the DoccatTool "SimpleTokenize" ? > > This should not be the case. Both should use the same tokenizer; in > particular : The whitespace tokenizer ! > > If not which one is used ? > > Best regards > > /Nicolas >
