[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Robert Muir (JIRA) Tue, 16 Feb 2010 13:13:51 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834471#action_12834471
 ]


Robert Muir commented on LUCENE-1224:
-------------------------------------

I too think its really important we fix this. I have to agree with Hiroaki's 
analysis of the situation, and the problems can be seen by looking at the code 
in both the filter/tokenizers and the tests themselves.

Currently the tokenizers are limited to 1024 characters (LUCENE-1227), this is 
very related to this issue.
Look at the test for 1,3 ngrams of "abcde":
{code}
public void testNgrams() throws Exception {
        NGramTokenizer tokenizer = new NGramTokenizer(input, 1, 3);
        assertTokenStreamContents(tokenizer,
          new String[]{"a","b","c","d","e", "ab","bc","cd","de", 
"abc","bcd","cde"}, 
{code}

in my opinion the output should instead be: a, ab, ...
Otherwise the tokenizer will either always be limited to 1024 chars or must 
read the entire document into RAM.
This same problem exists for the EdgeNgram variants.

I agree with Grant's comment about the philosophical discussion about positions 
of the tokens, perhaps we need an option for this (where they are all posInc=1, 
or the posInc=0 is generated based on whitespace). I guess I think we could 
accomodate both needs by having tokenizer/filter variants too, but I'm not sure.

The general problem i have with trying to determine a fix is that it will break 
backwards compatibility, and I also know that EdgeNGram is being used for some 
purposes such as "auto-suggest". So I don't really have any idea beyond making 
new filters/tokenizers, as I think there is another use case where the old 
behavior fits?


> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, 
> NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string 
> into an index, but I can't query it with "abc". If I query with "ab", I can 
> get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. 
> Query is based on the Token order in the TokenStream, that how stemming or 
> phrase should be anlayzed is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream

Reply via email to