[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

Karl Wettin (JIRA) Sun, 31 Aug 2008 06:20:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627308#action_12627308
 ]


Karl Wettin commented on LUCENE-1370:
-------------------------------------

It's an OK filter setting if you ask me.

However I'm curious to why you don't query for unigrams unless the input is a 
single token? That means you always require a 0-slop between any two tokens of 
the input. I know nothing about your needs, but it could be dangerous. You can 
always boost the bigrams a bit more than the unigrams if they cause a problem. 
I think you should benchmark the cost. I'm sure it's rather small and that 
you'll get better quality results by doing that. Users tend to never enter a 
query the way I want them to.

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token 
> stream is only one token long, then ShingleFilter.next() won't return any 
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this 
> option is set and the underlying stream is only one token long, then 
> ShingleFilter will return that token, regardless of the setting of 
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using 
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using 
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters 
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very 
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a 
> single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like 
> this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I 
> thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

Reply via email to