[ https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547774#comment-13547774 ]
Harald Wellmann commented on LUCENE-1227: ----------------------------------------- As long as this issue is not fixed, please mention the 1024 character truncation in the Javadoc. The combination of KeywordTokenizer and NGramTokenFilter does not scale well for large inputs, as KeywordTokenizer reads the entire input stream into a character buffer. > NGramTokenizer to handle more than 1024 chars > --------------------------------------------- > > Key: LUCENE-1227 > URL: https://issues.apache.org/jira/browse/LUCENE-1227 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Hiroaki Kawai > Priority: Minor > Attachments: LUCENE-1227.patch, NGramTokenizer.patch, > NGramTokenizer.patch > > > Current NGramTokenizer can't handle character stream that is longer than > 1024. This is too short for non-whitespace-separated languages. > I created a patch for this issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org