[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Karl Wettin (JIRA) Tue, 17 Jun 2008 13:59:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605721#action_12605721
 ]


Karl Wettin commented on LUCENE-1306:
-------------------------------------

I'll refine and document this patch soon. Terrible busy though. Hasty responses:

bq. Should there be a way for the client of this class to specify the prefix 
and suffix char? 
bq. 1. prefix and suffix chars should be configurable. Because user must choose 
a char that is not used in the terms.

There are getters and setters, but nothing in the constructor.

bq. Is having, for example, "^h" as the first bi-gram token really the right 
thing to do? Would "^he" make more sense? I know that makes it 3 characters 
long, but it's 2 chars from the input string. Not sure, so I'm asking.

I always considered 'start of word' and 'end of word' as a single character and 
a part of n. I might be wrong though. I'll have to take a look at what other 
people did. It would not be a very hard thing to include a setting for that.

bq. Is this primarily to distinguish between the edge and inner n-grams? If so, 
would it make more sense to just make use of Token type variable instead?
bq. one could use the "flags" to indicate what the token is. 

I might be missing something in your line of questioning. Don't understand what 
it would help to have the flag or token type as they are not stored in the 
index.

I don't want separate fields for the prefix, inner and suffix grams, I want to 
use the same single filter at query time. I typically pass down the gram boost 
in the payload, evaluated on gram size, how far away it is from the prefix and 
suffix, et c. 

bq. 3. If you want to do a phrase query (for example, "This is"), we have to 
generate $^ token in the gap to make the positions valid.

If you are creating ngrams over multiple words, say a sentence, then I state 
that there should only be a prefix in the start of the senstance and a suffix 
in the end of the sentance and that grams will contain whitespace. I never did 
phrase queries using grams but I'd probably want prefix and suffix around each 
token. This is another good reason to keep them in the same field with prefix 
and suffix markers in the token, or?

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix 
> markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter

Reply via email to