[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Hiroaki Kawai (JIRA) Thu, 15 May 2008 08:16:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597156#action_12597156
 ]


Hiroaki Kawai commented on LUCENE-1224:
---------------------------------------

About test code: I'm not going to say that "I'm right". I just wanted to 
address the issue and share what we should solve. If you don't like the code, 
please just tell me how I should do (the better way). I initially put the code 
there because I thought it was reasonable and proper, but I'm fine with 
changing it.

{quote}
For example, I think it makes sense to search for "th ex" as a phrase query
{quote}

For example, I think it makes sense to search for "example" as a phrase query 
instead.

I want to address that NGramTokenizer is very useful for 
non-white-space-separated languages, for example Japanese. In that case, we 
won't search "th ex", because it assumes sentences are separated by whte space. 
I want to search by a fragment of a text sequence.

I agree that this might be a big problem. IMHO, the issues comes from concept 
mismatch of TokenFilter and TermPosition. The discussion should moved to 
mailing-list?

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Reply via email to