[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Grant Ingersoll (JIRA) Wed, 14 May 2008 04:26:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596729#action_12596729
 ]


Grant Ingersoll commented on LUCENE-1224:
-----------------------------------------

Hi Hiroaki,

I have been reviewing the tests for this and have a couple of comments.  First, 
I don't see why you need to bring indexing into the equation.  Second, the 
changes to testNGrams still don't test the issue, namely they don't examine 
that the output ngrams are actually in the correct position.  I think you 
deduce this later in testIndexAndQuery, but it is never explicitly stated.  I'd 
drop testIndexAndQuery and just fix testNGrams such that it checks the 
positions appropriately.  

On a more philosophical level, it is a bit curious to me that if we have the 
strings "abcde fghi" that we are fine with "b" being at position 1, and not at 
position 0, but "ab" needs to be at position 0.  I wonder if there is any 
thoughts on what the relative positions of ngrams should be.  Should they all 
occur at the same position?  It seems to me, that it doesn't make sense that 
the "f" ngrams don't start until some position other than 1.  This would 
currently prevent doing phrase queries such as "ab fg" with no slop.

I'm assuming this applies to LUCENE-1225 as well.

I will link 1225 to this issue, and you can attach a single patch.

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Reply via email to