[ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605721#action_12605721 ]
Karl Wettin commented on LUCENE-1306: ------------------------------------- I'll refine and document this patch soon. Terrible busy though. Hasty responses: bq. Should there be a way for the client of this class to specify the prefix and suffix char? bq. 1. prefix and suffix chars should be configurable. Because user must choose a char that is not used in the terms. There are getters and setters, but nothing in the constructor. bq. Is having, for example, "^h" as the first bi-gram token really the right thing to do? Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from the input string. Not sure, so I'm asking. I always considered 'start of word' and 'end of word' as a single character and a part of n. I might be wrong though. I'll have to take a look at what other people did. It would not be a very hard thing to include a setting for that. bq. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make more sense to just make use of Token type variable instead? bq. one could use the "flags" to indicate what the token is. I might be missing something in your line of questioning. Don't understand what it would help to have the flag or token type as they are not stored in the index. I don't want separate fields for the prefix, inner and suffix grams, I want to use the same single filter at query time. I typically pass down the gram boost in the payload, evaluated on gram size, how far away it is from the prefix and suffix, et c. bq. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token in the gap to make the positions valid. If you are creating ngrams over multiple words, say a sentence, then I state that there should only be a prefix in the start of the senstance and a suffix in the end of the sentance and that grams will contain whitespace. I never did phrase queries using grams but I'd probably want prefix and suffix around each token. This is another good reason to keep them in the same field with prefix and suffix markers in the token, or? > CombinedNGramTokenFilter > ------------------------ > > Key: LUCENE-1306 > URL: https://issues.apache.org/jira/browse/LUCENE-1306 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Reporter: Karl Wettin > Assignee: Karl Wettin > Priority: Trivial > Attachments: LUCENE-1306.txt > > > Alternative NGram filter that produce tokens with composite prefix and suffix > markers. > {code:java} > ts = new WhitespaceTokenizer(new StringReader("hello")); > ts = new CombinedNGramTokenFilter(ts, 2, 2); > assertNext(ts, "^h"); > assertNext(ts, "he"); > assertNext(ts, "el"); > assertNext(ts, "ll"); > assertNext(ts, "lo"); > assertNext(ts, "o$"); > assertNull(ts.next()); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]