[ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834471#action_12834471 ]
Robert Muir commented on LUCENE-1224: ------------------------------------- I too think its really important we fix this. I have to agree with Hiroaki's analysis of the situation, and the problems can be seen by looking at the code in both the filter/tokenizers and the tests themselves. Currently the tokenizers are limited to 1024 characters (LUCENE-1227), this is very related to this issue. Look at the test for 1,3 ngrams of "abcde": {code} public void testNgrams() throws Exception { NGramTokenizer tokenizer = new NGramTokenizer(input, 1, 3); assertTokenStreamContents(tokenizer, new String[]{"a","b","c","d","e", "ab","bc","cd","de", "abc","bcd","cde"}, {code} in my opinion the output should instead be: a, ab, ... Otherwise the tokenizer will either always be limited to 1024 chars or must read the entire document into RAM. This same problem exists for the EdgeNgram variants. I agree with Grant's comment about the philosophical discussion about positions of the tokens, perhaps we need an option for this (where they are all posInc=1, or the posInc=0 is generated based on whitespace). I guess I think we could accomodate both needs by having tokenizer/filter variants too, but I'm not sure. The general problem i have with trying to determine a fix is that it will break backwards compatibility, and I also know that EdgeNGram is being used for some purposes such as "auto-suggest". So I don't really have any idea beyond making new filters/tokenizers, as I think there is another use case where the old behavior fits? > NGramTokenFilter creates bad TokenStream > ---------------------------------------- > > Key: LUCENE-1224 > URL: https://issues.apache.org/jira/browse/LUCENE-1224 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Reporter: Hiroaki Kawai > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, > NGramTokenFilter.patch > > > With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string > into an index, but I can't query it with "abc". If I query with "ab", I can > get a hit result. > The reason is that the NGramTokenFilter generates badly ordered TokenStream. > Query is based on the Token order in the TokenStream, that how stemming or > phrase should be anlayzed is based on the order (Token.positionIncrement). > With current filter, query string "abc" is tokenized to : ab bc abc > meaning "query a string that has ab bc abc in this order". > Expected filter will generate : ab abc(positionIncrement=0) bc > meaning "query a string that has (ab|abc) bc in this order" > I'd like to submit a patch for this issue. :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org