[ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596729#action_12596729 ]
Grant Ingersoll commented on LUCENE-1224: ----------------------------------------- Hi Hiroaki, I have been reviewing the tests for this and have a couple of comments. First, I don't see why you need to bring indexing into the equation. Second, the changes to testNGrams still don't test the issue, namely they don't examine that the output ngrams are actually in the correct position. I think you deduce this later in testIndexAndQuery, but it is never explicitly stated. I'd drop testIndexAndQuery and just fix testNGrams such that it checks the positions appropriately. On a more philosophical level, it is a bit curious to me that if we have the strings "abcde fghi" that we are fine with "b" being at position 1, and not at position 0, but "ab" needs to be at position 0. I wonder if there is any thoughts on what the relative positions of ngrams should be. Should they all occur at the same position? It seems to me, that it doesn't make sense that the "f" ngrams don't start until some position other than 1. This would currently prevent doing phrase queries such as "ab fg" with no slop. I'm assuming this applies to LUCENE-1225 as well. I will link 1225 to this issue, and you can attach a single patch. > NGramTokenFilter creates bad TokenStream > ---------------------------------------- > > Key: LUCENE-1224 > URL: https://issues.apache.org/jira/browse/LUCENE-1224 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Reporter: Hiroaki Kawai > Assignee: Grant Ingersoll > Priority: Critical > Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, > NGramTokenFilter.patch > > > With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string > into an index, but I can't query it with "abc". If I query with "ab", I can > get a hit result. > The reason is that the NGramTokenFilter generates badly ordered TokenStream. > Query is based on the Token order in the TokenStream, that how stemming or > phrase should be anlayzed is based on the order (Token.positionIncrement). > With current filter, query string "abc" is tokenized to : ab bc abc > meaning "query a string that has ab bc abc in this order". > Expected filter will generate : ab abc(positionIncrement=0) bc > meaning "query a string that has (ab|abc) bc in this order" > I'd like to submit a patch for this issue. :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]