[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Hiroaki Kawai (JIRA) Tue, 17 Jun 2008 22:07:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605836#action_12605836
 ]


Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ 
token in the ngram stream.

{quote}
I don't want separate fields for the prefix, inner and suffix grams, I want to 
use the same single filter at query time. 
{quote}

I agree with that. :)

Then, let's consider about the phrase query.
1. At store time, we want to store a sentence "This is a pen"
2. At query time, we want to query with "This is"

At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$

At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$

We can find that the stored sequence because it contains the query sequence.

{quote}
If you are creating ngrams over multiple words, say a sentence, then I state 
that there should only be a prefix in the start of the senstance and a suffix 
in the end of the sentance and that grams will contain whitespace.
{quote}

If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), 
we get:
"^T","Th","hi","is","s "," i","is","s$"

We can't find the stored sequence because it does not contain the query 
sequence. n-gram query is always phrase query in the micro scope. 

+1 for prefix and suffix markers in the token.

{quote}
Note, also, that one could use the "flags" to indicate what the token is. I 
know that's a little up in the air just yet, but it does exist. 
{quote}

Yes, there is a flags. Of cource, we can use it. But I can't find the way to 
use them efficiently in THIS CASE, right now.

{quote}
This would mean that no stripping of special chars is required.
{quote}

Unfortunately, stripping is done outside of the ngram filter by 
WhitespaceTokenizer.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix 
> markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Reply via email to