[jira] [Updated] (SOLR-11641) {!frange} / FunctionRangeQuery should default to 100==getCost() so non-cached fq's default to post-filtering

Hoss Man (JIRA) Mon, 13 Nov 2017 10:43:52 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-11641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-11641:
----------------------------
    Attachment: SOLR-11641.patch

AFAICT regarding FunctionRangeQuery:

* if cache==true then it has to check every doc to build the DocSet anyway and 
FunctionRangeQuery is as efficient as it can be 
* if cache==false:
** if cost >= 100: the FunctionRangeQuery is used as a post filter and only 
ever asked to compute the function for docs that are already confirmed 
candidates by other clauses
** if cost < 100: the FunctionRangeQuery is used as a regular (Conjunction) 
FILTER clause, and in order to "skip to" it's first/next matching document may 
have to unnecessaryily compute the function for many (non-candidate) documents 
that could have been quickly excluded by other clauses if they had been 
consulted first.

Consider two pathelogical situations:

* A:  the main query matches all docs, but the filter matches nothing, 
example...
** {noformat}
q=*:*&fq={!frange cache=false cost=$x h=-1}abs(some_complex_func(...))
{noformat}

* B: the main query matches exactly one doc, but the filter matches everything, 
example...
** {noformat}
q=id:0&fq=\{!frange cache=false cost=$x l=-1\}abs(some_complex_funcc(...))
{noformat}

In #A, regardless of the {{x}} value (determining wether the frange is used as 
a PostFilter, or as a Conjunction FILTER) the {{abs(some_complex_func(...)}} 
function will be computed for every document:

* as a FILTER:
** the q clauses is checked first, and matches docId#0
** then the frange is asked to {{advance(0)}} -- and when it confirms docId#0 
doesn't match it will start looping over all docs -- executing the function -- 
looking for the 'next' match after that
* as a PostFilter:
** the q clauses is checked first
** for each doc it matches (all of them) the PostFilter is checked -- executing 
the function.

In the case of #B, the number of documens that need to have 
{{abs(some_complex_func(...)}} computed will be drasticly differnet based on 
the code path used (ie: the value of X)

* as a FILTER:
** the q clauses is checked first, and matches docId#0
** then the frange is asked to {{advance(0)}} -- and when it confirms docId#0 
doesn't match it will start looping over all docs -- executing the function -- 
looking for the 'next' match after that
* as a PostFilter:
** the q clauses is checked first
** for each doc it matches (only docId#0) the PostFilter is checked -- 
executing the function


Regardless of how bad the pathelogical case -- or in which direction it's 
pathelogical -- it still seems much more efficient to run all 
FunctionRangeQuery instances as PostFilters by default.


AFAICT: even in a "50/50" type situation, where a FunctionRangeQuery matches a 
random 50% of the documents, and the other q/fq clauses match an different 
(indepenently) random 50% of the documents, it still seems much more efficient 
to always execute the FunctionRangeQuery as a PostFilter.  Because 
fundementally the FunctionRangeQuery can't provide any efficiencies/savings in 
{{advance(docId)}} to identify candidate matches, so any time it's "behind" the 
other clauses in the Conjunction, there's no point in asking it to "advance" to 
it's next patch -- we should just ask it "do you match X?" and if not, let the 
other (likeley more optimized) clauses "skip ahead" to their next match, rather 
then asking the FunctionRangeQuery to try.

----

I'm attaching a trivial, untested patch -- hopefully this should fail some 
caching based tests, and some query equality checks, plus there should be new 
tests that it optimizes the cases we expect to optimize -- but this shows the 
bare bones of what i'm suggesting we changing

> {!frange} / FunctionRangeQuery should default to 100==getCost() so non-cached 
> fq's default to post-filtering
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11641
>                 URL: https://issues.apache.org/jira/browse/SOLR-11641
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: query parsers
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>         Attachments: SOLR-11641.patch
>
>
> While reviewing the code paths that can result in the execution of an 'fq', I 
> realized that executing an {{'{!frange cache=false ... \}'}} query (with 
> default 'cost=0' localparam) that matches "very few" documents (compared to 
> the other q/fq clauses) can result in a pathelogical "bad" case situation 
> where the function is computed unneccessarily for lots of documents in order 
> for the Scorer to satisfy the {{advance(int)}} API of returning the "next" 
> matching document -- making that situation benefit from using the post-filter 
> code path just as much as if the {{'{!frange\}'"" matches "very many" 
> documents (compared to the other q/fq clauses)
> In other words:  because FunctionRangeQuery has no ability to effectively 
> "skip ahead" to the next match, there is no advantage (that I can see) in 
> executing a FunctionRangeQuery as "regular" Filter in a Conjunction with the 
> other query clauses.
> I think we should change the default behavior of {{'{!frange\}'}} so that the 
> effective default {{cost==100}} so that _when a user specifies cache==false_ 
> they run as post filters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-11641) {!frange} / FunctionRangeQuery should default to 100==getCost() so non-cached fq's default to post-filtering

Reply via email to