[
https://issues.apache.org/jira/browse/SOLR-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222461#comment-13222461
]
Robert Muir commented on SOLR-2660:
-----------------------------------
I think this could be a good option (in combination with shingles as
mentioned), to accelerate
the phrase queries that solr query parsers generate in order to boost closer
matches.
Again the idea is to omit positions entirely, and instead use shinglefilter
(unigrams and bigrams), approximating phrase
queries with n-gram conjunctions. I think for the sloppy case, we should use an
n-gram disjunction, perhaps interpreting
slop factor as minNrShouldmatch?
This basically means you are substituting levenshtein distance for an n-gram
approximation in both cases.
In general its a classic indexing/search tradeoff, in my tests on wikipedia
indexing takes ~ twice as long with the shingles,
but the tradeoff is that for a lot of these use cases you don't need to consult
the positions file at all.
As a parameter to the fieldtype its easily pluggable without messing with any
queryparsers, and ordinary queries (term, boolean, etc)
are totally 'pass-thru', *however* the thing I guess I don't like about this
patch is the fact that this is really a different
'query intent', in other words, I think its a perfect approach when you just
want to boost scores of close matches
(e.g. when generated by dismax queryparser), but when your 'intent' is to
actually limit matches to a phrase
(e.g. when keyed in by a user directly), then this approximation isn't as good
of a fit.
Either way I'm open to other opinions before doing anything (if we decide to do
it, next step I think is to update the patch with
the SloppyPhraseQuery approximation).
> omitPositions improvements
> --------------------------
>
> Key: SOLR-2660
> URL: https://issues.apache.org/jira/browse/SOLR-2660
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 3.3, 4.0
> Reporter: Robert Muir
> Priority: Minor
> Attachments: SOLR-2660.patch
>
>
> followup to LUCENE-2048:
> Adds factory methods getPhraseQuery/getMultiPhraseQuery to QP, this way you
> can subclass it and customize behavior, particularly
> * by default, Solr throws exception here if the fieldtype omits positions:
> rather than 3.x's silent failure of no results, and even for trunk its nicer
> to fail during query parsing rather than waiting for lucene's failure during
> execution.
> * adds phraseAsBoolean, which allows you to downgrade these
> phrase/multiphrase queries to boolean queries: this is a nice option in
> conjunction with our word n-gram filters (shingle/commongrams/etc)for a fast
> "approximation", if your application can tolerate some false positives, e.g.
> "foo bar" -> termQuery(foo_bar), "foo bar baz" -> BQ(foo_bar AND bar_baz)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]