Praveena B created OPENNLP-697:
----------------------------------

             Summary: Tokenizer class is hardcoded in the DocumentSampleStream 
class. 
                 Key: OPENNLP-697
                 URL: https://issues.apache.org/jira/browse/OPENNLP-697
             Project: OpenNLP
          Issue Type: Bug
          Components: Doccat, Tokenizer
    Affects Versions: 1.6.0
            Reporter: Praveena B


While training the DocumentCategorizerME it is possible to set the type of 
Tokenizer that the categorizer should use.
i,e doccatFactory.setTokenizer(SemicolonTokenizer.INSTANCE); 

But the Tokenizer class is hardcoded to WhitespaceTokenizer in the 
DocumentSampleStream class. 
So it is not possible to modify the default tokenizing behaviour even after 
setting it in the doccatFactory.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to