The DoccatTool now uses the WhitespaceTokenizer to tokenize the input text, see the issue here: https://issues.apache.org/jira/browse/OPENNLP-568
Its fixed in trunk and will go into our next release candidate, please test if that fixes your issue. Jörn On 03/28/2013 03:10 PM, Nicolas Hernandez wrote:
Dear all I have not tracked yet the whole process but because some unexpected doccat results I looked a little bit at the code. Do you confirm that the DoccatTrainerTool whitespace tokenize (by creating DocumentSample) while the DoccatTool "SimpleTokenize" ? This should not be the case. Both should use the same tokenizer; in particular : The whitespace tokenizer ! If not which one is used ? Best regards /Nicolas
