[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133801#comment-17133801 ]
Chris M. Hostetter commented on SOLR-13132: ------------------------------------------- Hey Michael, I still haven't had a chance to dig into the recent commits, hopefully i can do that tomorrow, but in response to some of your comments here... ---- bq ... To support such a case, a shim is required because the code paths that do the actual count accumulation (in ByArrayUIF and ByArrayDV) used to directly increment processor.countAcc, and have now been switched to register counts via the SweepDocIterator and SweepDISI abstractions, ... Right right right ... I'm really sorry, i keep forgetting: the changes in this issue to "support" sweeping as a concept affect the the low level impls of ByArrayUIF & ByArrayDV such that now they _only_ work by "sweeping" over the set defined by the SweepingCountSlotAcc – the only question (at run time) is whether that set comes from *just* the "base set" or if there are any other sets (provided by other CountSlotAccs, in turn provided by other SweepableSlotAcc) that they sweep over at the same time So, with that in mind: please ignore/retract my earlier comments about being concerned about subclasses that don't want to sweep.... * If it simplifies the code, we can certainly assume/assert that any/all future hypothetical ByArrayUIF & ByArrayDV subclasses *must* support sweeping & use a SweepingCountSlotAcc ** provided we make sure to spell that out in the javadocs. * It would be _nice_ to keep the changes to FacetFieldProcessorByArray to a minimum, and say "FacetFieldProcessorByArray will/should not assume all subclasses can sweep" – but there's no reason we _have_ to.... ** *_If_* the code would be a lot simpler to say "all current FacetFieldProcessorByArray subclasses use sweeping, so we're going to document that from now on any additional future subclasses of FacetFieldProcessorByArray use sweeping and assert/assume that in the common FacetFieldProcessorByArray code" then we can certainly do that *** ie: would that allow us to remove the Shim? *** ie: would it allow us to refactor/merge the Shim impl into the DEV_NULL_COUNT_ACC impl? (IIRC the refinement code path that uses the DEV_NULL countAcc is in FacetFieldProcessorByArray ... correct?) So my question to you is: What do you think? Do you think there are simplification gains to be had if we add assertions & assumptions about these classes always using SweepingCountSlotAcc? ---- {quote}I've thought a bit more about the question of how to detect the allBuckets slot for disabling allBuckets relatedness: I don't really have any good answers, but a handful of thoughts: ... {quote} # i don't think adding SlotContext to the setValues(...) API would work in general because in practice there's no guarantee Processors will have a valid SlotContext at that point in time (i'm thinking of per-segment DVs that use the soltNum as an ord lookup, or TermEnum that just returns the "current" term) # i do think the "papa-bear" approach would work well in the long run (both in terms of being a clean/consistent API and being useful for this particular problem), but i'm still not convinced it's worth the hassle at this point since we really only have this one useage where it matters # considering how much of an "edge case on an edge case" we're talking about, you're current "hack" is growing on me, provided we add some more conditional logic to protect against the possibility of a ClassCastEx if anyone ever adds "sweep" support to some other processor (ie: FacetQueryProcessor) ...either way: I'd suggest we punt for now and worry about all the other nocommits before worrying about the the "which slot is allBuckets when sweeping?" nocommit. > Improve JSON "terms" facet performance when sorted by relatedness > ------------------------------------------------------------------ > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module > Affects Versions: 7.4, master (9.0) > Reporter: Michael Gibney > Priority: Major > Attachments: SOLR-13132-with-cache-01.patch, > SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org