[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

Chris Harris (JIRA) Sun, 31 Aug 2008 14:16:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627348#action_12627348
 ]


Chris Harris commented on LUCENE-1370:
--------------------------------------

: Do you say it is 50x faster with shingle queries that only contains bigram
: compared to shingle queries that contains uni- and bigrams? Or is it 50x
: faster using shingles compared to phrase queries? (I've myself seen
: performance gains similar to the latter.)

I'm not sure I totally understand you, so let me try rephrasing things. What I 
mean is that phrase queries that have only bigrams, like this:

PhraseQuery
{
"please divide"
"divide this"
"this sentence"
"sentence into"
"into shingles"
}

run maybe 50x as fast as phrase queries that have both bigrams and unigrams, 
like this:

PhraseQuery
{
"please", "please divide"
"divide", "divide this"
"this", "this sentence"
"sentence", "sentence into"
"into", "into shingles"
"shingles"
}

If it clarifies things any further, let me say that I'm handling all quoted 
phrase queries with the normal PhraseQuery class; I'm not, for instance, 
turning quoted phrases into some kind of BooleanQuery. (Technically it's not me 
that's making the PhraseQuery object but Solr and its query parser. But Solr is 
indeed turning my quoted phrase queries into normal PhraseQuery objects.)

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token 
> stream is only one token long, then ShingleFilter.next() won't return any 
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this 
> option is set and the underlying stream is only one token long, then 
> ShingleFilter will return that token, regardless of the setting of 
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using 
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using 
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters 
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very 
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a 
> single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like 
> this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I 
> thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

Reply via email to