[
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605836#action_12605836
]
Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------
First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^
token in the ngram stream.
{quote}
I don't want separate fields for the prefix, inner and suffix grams, I want to
use the same single filter at query time.
{quote}
I agree with that. :)
Then, let's consider about the phrase query.
1. At store time, we want to store a sentence "This is a pen"
2. At query time, we want to query with "This is"
At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$
At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$
We can find that the stored sequence because it contains the query sequence.
{quote}
If you are creating ngrams over multiple words, say a sentence, then I state
that there should only be a prefix in the start of the senstance and a suffix
in the end of the sentance and that grams will contain whitespace.
{quote}
If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2),
we get:
"^T","Th","hi","is","s "," i","is","s$"
We can't find the stored sequence because it does not contain the query
sequence. n-gram query is always phrase query in the micro scope.
+1 for prefix and suffix markers in the token.
{quote}
Note, also, that one could use the "flags" to indicate what the token is. I
know that's a little up in the air just yet, but it does exist.
{quote}
Yes, there is a flags. Of cource, we can use it. But I can't find the way to
use them efficiently in THIS CASE, right now.
{quote}
This would mean that no stripping of special chars is required.
{quote}
Unfortunately, stripping is done outside of the ngram filter by
WhitespaceTokenizer.
> CombinedNGramTokenFilter
> ------------------------
>
> Key: LUCENE-1306
> URL: https://issues.apache.org/jira/browse/LUCENE-1306
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/analyzers
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Priority: Trivial
> Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix
> markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]