[ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597107#action_12597107 ]
Grant Ingersoll commented on LUCENE-1224: ----------------------------------------- {quote}Umm..., if you don't like indexing and querying in the unit test, where should I place the join test that use NGramTokenizer? It might be nice if we could place that join test in a proper place.{quote} My point is, I don't think the test needs to do any indexing/querying at all to satisfy the change. It adds absolutely nothing to the test and only complicates the matter. {quote}I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer (in the core) test code has index&query test code in its unit tests.{quote} Just because another does it doesn't make it right. {quote} If we want to tokenize with white space tokenizer, the tokens are "This", "is", "an", "example" positions are 0,1,2,3 If we want to tokenize with 2-gram, the tokens are "Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le" positions are 0,1,2,3,4,... {quote} Yes, I understand how it currently works. My question is more along the lines of is this the right way of doing it? I don't know that it is, but it is a bigger question than you and me. I mean, if we are willing to accept that this issue is a bug, then it presents plenty of other problems in terms of position related queries. For example, I think it makes sense to search for "th ex" as a phrase query, but that is not possible do to the positions (at least not w/o a lot of slop) > NGramTokenFilter creates bad TokenStream > ---------------------------------------- > > Key: LUCENE-1224 > URL: https://issues.apache.org/jira/browse/LUCENE-1224 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Reporter: Hiroaki Kawai > Assignee: Grant Ingersoll > Priority: Critical > Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, > NGramTokenFilter.patch > > > With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string > into an index, but I can't query it with "abc". If I query with "ab", I can > get a hit result. > The reason is that the NGramTokenFilter generates badly ordered TokenStream. > Query is based on the Token order in the TokenStream, that how stemming or > phrase should be anlayzed is based on the order (Token.positionIncrement). > With current filter, query string "abc" is tokenized to : ab bc abc > meaning "query a string that has ab bc abc in this order". > Expected filter will generate : ab abc(positionIncrement=0) bc > meaning "query a string that has (ab|abc) bc in this order" > I'd like to submit a patch for this issue. :-) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]