[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Thu, 11 Jun 2020 17:14:24 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133801#comment-17133801
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

Hey Michael, I still haven't had a chance to dig into the recent commits, 
hopefully i can do that tomorrow, but in response to some of your comments 
here...
----
bq ... To support such a case, a shim is required because the code paths that 
do the actual count accumulation (in ByArrayUIF and ByArrayDV) used to directly 
increment processor.countAcc, and have now been switched to register counts via 
the SweepDocIterator and SweepDISI abstractions, ...

Right right right ... I'm really sorry, i keep forgetting: the changes in this 
issue to "support" sweeping as a concept affect the the low level impls of 
ByArrayUIF & ByArrayDV such that now they _only_ work by "sweeping" over the 
set defined by the SweepingCountSlotAcc – the only question (at run time) is 
whether that set comes from *just* the "base set" or if there are any other 
sets (provided by other CountSlotAccs, in turn provided by other 
SweepableSlotAcc) that they sweep over at the same time

So, with that in mind: please ignore/retract my earlier comments about being 
concerned about subclasses that don't want to sweep....
 * If it simplifies the code, we can certainly assume/assert that any/all 
future hypothetical ByArrayUIF & ByArrayDV subclasses *must* support sweeping & 
use a SweepingCountSlotAcc
 ** provided we make sure to spell that out in the javadocs.
 * It would be _nice_ to keep the changes to FacetFieldProcessorByArray to a 
minimum, and say "FacetFieldProcessorByArray will/should not assume all 
subclasses can sweep" – but there's no reason we _have_ to....
 ** *_If_* the code would be a lot simpler to say "all current 
FacetFieldProcessorByArray subclasses use sweeping, so we're going to document 
that from now on any additional future subclasses of FacetFieldProcessorByArray 
use sweeping and assert/assume that in the common FacetFieldProcessorByArray 
code" then we can certainly do that
 *** ie: would that allow us to remove the Shim?
 *** ie: would it allow us to refactor/merge the Shim impl into the 
DEV_NULL_COUNT_ACC impl? (IIRC the refinement code path that uses the DEV_NULL 
countAcc is in FacetFieldProcessorByArray ... correct?)

So my question to you is: What do you think? Do you think there are 
simplification gains to be had if we add assertions & assumptions about these 
classes always using SweepingCountSlotAcc? 

 
----
{quote}I've thought a bit more about the question of how to detect the 
allBuckets slot for disabling allBuckets relatedness: I don't really have any 
good answers, but a handful of thoughts: ...
{quote} # i don't think adding SlotContext to the setValues(...) API would work 
in general because in practice there's no guarantee Processors will have a 
valid SlotContext at that point in time (i'm thinking of per-segment DVs that 
use the soltNum as an ord lookup, or TermEnum that just returns the "current" 
term)
 # i do think the "papa-bear" approach would work well in the long run (both in 
terms of being a clean/consistent API and being useful for this particular 
problem), but i'm still not convinced it's worth the hassle at this point since 
we really only have this one useage where it matters
 # considering how much of an "edge case on an edge case" we're talking about, 
you're current "hack" is growing on me, provided we add some more conditional 
logic to protect against the possibility of a ClassCastEx if anyone ever adds 
"sweep" support to some other processor (ie: FacetQueryProcessor)

...either way: I'd suggest we punt for now and worry about all the other 
nocommits before worrying about the the "which slot is allBuckets when 
sweeping?" nocommit.

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to