[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Grant Ingersoll (JIRA) Thu, 15 May 2008 05:04:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597107#action_12597107
 ]


Grant Ingersoll commented on LUCENE-1224:
-----------------------------------------

{quote}Umm..., if you don't like indexing and querying in the unit test, where 
should I place the join test that use NGramTokenizer? It might be nice if we 
could place that join test in a proper place.{quote}

My point is, I don't think the test needs to do any indexing/querying at all to 
satisfy the change.  It adds absolutely nothing to the test and only 
complicates the matter.

{quote}I placed the testIndexAndQuery in the code because the other code like 
KeywordAnalyzer (in the core) test code has index&query test code in its unit 
tests.{quote}

Just because another does it doesn't make it right.

{quote}
If we want to tokenize with white space tokenizer, the tokens are
"This", "is", "an", "example"
positions are 0,1,2,3

If we want to tokenize with 2-gram, the tokens are
"Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" 
"le"
positions are 0,1,2,3,4,...
{quote}

Yes, I understand how it currently works.  My question is more along the lines 
of is this the right way of doing it?  I don't know that it is, but it is a 
bigger question than you and me.  I mean, if we are willing to accept that this 
issue is a bug, then it presents plenty of other problems in terms of position 
related queries.  For example, I think it makes sense to search for "th ex" as 
a phrase query, but that is not possible do to the positions (at least not w/o 
a lot of slop)




> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Reply via email to