Yes, thats a bug in the Doccat Tool, both tools should process the
OpenNLP default format which
is a document per line and whitespace tokenized.
The trainer seems to work fine, and the DoccatTool needs to use the
Whitespace tokenizer instead
of the Simple Tokenizer. Thanks for figuring that out!
Nicolas, do you mind to open a jira?
Jörn
On 03/29/2013 03:55 AM, William Colen wrote:
In my opinion you are right. It would be safer to use whitespace tokenizer
than SimpleTokenizer.
But I could not check if DoccatTrainerTool is using whitespace tokenizer.
Actually, the only DocumentSample provider we have today is the one that
reads Leipzig corpus, and as far as I know it uses the SimpleTokenizer
because the entries are not tokenized.
On Thu, Mar 28, 2013 at 11:10 AM, Nicolas Hernandez <
[email protected]> wrote:
Dear all
I have not tracked yet the whole process but because some unexpected
doccat results I looked a little bit at the code.
Do you confirm that the DoccatTrainerTool whitespace tokenize (by
creating DocumentSample) while the DoccatTool "SimpleTokenize" ?
This should not be the case. Both should use the same tokenizer; in
particular : The whitespace tokenizer !
If not which one is used ?
Best regards
/Nicolas