Praveena B created OPENNLP-697:
----------------------------------
Summary: Tokenizer class is hardcoded in the DocumentSampleStream
class.
Key: OPENNLP-697
URL: https://issues.apache.org/jira/browse/OPENNLP-697
Project: OpenNLP
Issue Type: Bug
Components: Doccat, Tokenizer
Affects Versions: 1.6.0
Reporter: Praveena B
While training the DocumentCategorizerME it is possible to set the type of
Tokenizer that the categorizer should use.
i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE);
But the Tokenizer class is hardcoded to WhitespaceTokenizer in the
DocumentSampleStream class.
So it is not possible to modify the default tokenizing behaviour even after
setting it in the doccatFactory.
--
This message was sent by Atlassian JIRA
(v6.2#6252)