[ 
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1370:
--------------------------------

    Attachment: LUCENE-1370.patch

The previous patch predates the rewrite ShingleFilter was subjected to as a 
result of LUCENE-2218, so it needed to be rejiggered somewhat.

Changes from the previous patch:

# The new patch simply enables unigram output if the number of input tokens is 
less than minShingleSize.  The existing code then handles the situation 
appropriately, and reset() restores the original unigram output option.
# I renamed the option "outputUnigramIfNoNgrams" to 
"outputUnigramsIfNoShingles", because:
** Unigram -> Unigrams: the output could result in more than one unigram if 
minShingleSize is greater than the default 2; and
** Ngrams -> Shingles: for consistency with the class's name.
# I renamed "returnedAnyTokensYet" to "noShingleOutput", and reversed its 
(boolean) sense, because:
** unigrams should be output only if no *shingles* can be output, rather than 
no *tokens*; and
** reversing the sense allowed the test using it to avoid negation, and allowed 
the name to be shorter.
# I added a note to the setOutputUnigramsIfNoShingles() method javadoc to the 
effect that if outputUnigram == true, unigrams will always be output regardless 
of the setting of outputUnigramsIfNoShingles.
# I added a test that makes sure that when minShingleSize > 2 and the number of 
input tokens is less than minShingleSize, (multiple) unigrams are output


> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>            Reporter: Chris Harris
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, 
> LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token 
> stream is only one token long, then ShingleFilter.next() won't return any 
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this 
> option is set and the underlying stream is only one token long, then 
> ShingleFilter will return that token, regardless of the setting of 
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using 
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using 
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters 
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very 
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a 
> single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like 
> this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I 
> thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to