[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Fri, 01 May 2020 15:56:11 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097725#comment-17097725
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

{quote}... If you don't think it's a concern then it's no concern to me (other 
than that preserving order seems achievable here..
{quote}
Since it's already impossible to ensure keys maintain consistent order as the 
processor/options change, Let's deliberately not concern ourselves with 
maintaining them if/when sweep is used in the interest in keeping the code 
simpler.

(If/When someone cares about key order consistency down hte road, they'll have 
to tackle it at a much higher level then just sweeping – and in theory can do 
it after the buckets have been populated by re-ordering the key/val pairs)
{quote}...what would you think about modifying the method signature of 
SlotAcc.setValues(SimpleOrderedMap<Object> bucket, int slotNum) to return 
boolean instead of void, in the spirit of the setValues(...) method recently 
added to the SweepingAcc class? Or maybe less invasive, adding an analogous 
method to CountSlotAcc that returns boolean? ...
{quote}
Given that we seem to agree moving towards {{SweepingCountSlotAcc extends 
CountSlotArrAcc}} as a replacement for {{SweepingAcc}} then id prefer we avoid 
changing the {{CountSlotAcc}} API at all.
{quote}{quote}unless there's another use for CollectSlotAccMappingAware i'm 
overlooking?
{quote}
The main/initial purpose of this was to provide an interface for notifying 
FacetProcessors when SlotAccs had been swapped out. ...
{quote}
Ugh, sorry ... that was a half thought out idea that i phrased in the form of a 
bad, and overly brief, question.

I understand the current usage of FacetFieldProcessorSlotAccMapper (as a 
concrete implementation of CollectSlotAccMappingAware) that handles remapping 
of the sortAcc – what i was mulling was...
 * Assuming:
 ** we refactor CountSlotAcc's sweeping logic + SweepingAcc into a new 
{{SweepingCountSlotAcc}}
 ** you have no other purpose/usecase in mind for CollectSlotAccMappingAware as 
an API/abstraction
 * Then:
 ** {{SweepingCountSlotAcc}} 's constructor could take in the 
{{FacetFieldProcessor}} ('this') directly
 ** {{SweepingCountSlotAcc}} 's version of {{registerMapping()}} could directly 
manipulate {{processor.sortAcc}} when/if it gets mapped

...but as i said: that idea was half thought out, it's a refactoring to 
consider down the road depending on how some of hte other more significant API 
discussions we've been having shake out. Don't worry about it.
{quote}... but I agree and I think your suggestion of "Perhaps a single 
initSweeping(CollectSlotAccMappingAware notify) for use by processors, that 
fails if called more then once? All other code paths use getBaseSweepingAcc()" 
is a good way forward. I was waiting to address that until after discussion of 
these other questions.
{quote}
But we shouldn't actually need either of those methods anymore if we go the 
{{SweepingCountSlotAcc}} route ... correct?
 * {{countAcc.initSweeping(notify)}} can now just be {{countAcc = new 
SweepingCountSlotAcc(...)}}
 * {{countAcc.getBaseSweepingAcc()}} can now just be:
 ** {{(SweepingCountSlotAcc)countAcc}} when passed as an arg to  
{{SweepableAcc.registerSweepingAccs(...)}} or when doing the actual sweeping
 ** {{countAcc}} when the purpose is to call {{setValues}} (from 
FacetFieldProcessor, which should no longer need to have any idea that sweeping 
is even a concept)

...right?
{quote}This seems reasonable. If you don't mind, I can take a crack at seeing 
what this might look like. I'm particularly interested in doing this in a way 
that keeps the "sweeping" stuff modular (under the hood, at least), since it's 
possible that other concrete CountSlotAcc subclasses (whose core "count" 
functionality is substantially different) could still use the exact same code 
for coordinating sweeping. (I'm thinking particularly about cached counts ... 
definitely conscious of not having that bleed into this issue, just trying to 
be transparent!).
{quote}
Sure, go for it – but i would urge you not to go too overboard on modularity at 
this point: if all the details are nicely encapsulated inside a single new 
{{SweepingCountSlotAcc}} we can always refactor them out into more modular 
classes later if/when there is a need for that modularity.

In the same spirit of being transparent: I'm trying to keep the number of code 
paths touched by these changes (and the number of new APIS/abstractions/etc...) 
at a minimum in order to increase confidence in the correctness and reduce the 
likelihood of introducing bugs or unforseen side effects. (I'd rather focus on 
releasing a bicycle soon, and later have a dicussion about refactoring it into 
a car; then stall on releasing a bicycle while we spend a lot of time 
pre-planning for a future car)

Also: we should update your branch to merge in the current apache/master ... 
there's at least one (small) conflict i know of in the faceting code related to 
standardizing on using "long" values for count that will probably be easier to 
resolve sooner then later, and I'd like to get the fix for the changes that bit 
me in SOLR-14452 so that i can be more confident in beasting tests as we 
refactor things. ... if you don't object (or haven't already merged master in) 
by the next time i look at your branch (hopefully monday) i'll merge & push to 
your branch.

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to