Hi I have tested using the binary of opennlp 1.5.3 RC 3. My results have changed and seem to be more coherent with what I was expected.
Thanks On Tue, Apr 2, 2013 at 10:47 PM, Jörn Kottmann <[email protected]> wrote: > The DoccatTool now uses the WhitespaceTokenizer to tokenize > the input text, see the issue here: > https://issues.apache.org/jira/browse/OPENNLP-568 > > Its fixed in trunk and will go into our next release candidate, > please test if that fixes your issue. > > Jörn > > > On 03/28/2013 03:10 PM, Nicolas Hernandez wrote: >> >> Dear all >> >> I have not tracked yet the whole process but because some unexpected >> doccat results I looked a little bit at the code. >> >> Do you confirm that the DoccatTrainerTool whitespace tokenize (by >> creating DocumentSample) while the DoccatTool "SimpleTokenize" ? >> >> This should not be the case. Both should use the same tokenizer; in >> particular : The whitespace tokenizer ! >> >> If not which one is used ? >> >> Best regards >> >> /Nicolas > > -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
