[
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated SOLR-9142:
-------------------------------
Attachment: SOLR_9412_FacetFieldProcessorByHashDV.patch
Updated path to fix a bug:
While running tests I noticed an odd failure in TestRandomDVFaceting, which
doesn't explicitly use the JSON Facet API, however it does set facet.method=uif
and it turns out SimpleFacets.java calls out to JSON Facet API to do it. Wow;
you learn something new every day, as they say. Of course, setting the method
doesn't necessarily mean that UIF will be used, and in the case of a single
valued number (score_f field which is a float) -- it certainly won't be -- it
uses this hash method. TestRandomDVFaceting is an awesome test -- very
thorough. And it tickled a bug in my refactoring/consolidation of findTopSlots
that occurs when there are more collected values than the top-X you want --
when it's sort by count and falls-back on index order to tie-break equal counts.
So I fixed it by simplifying the use of the PriorityQueue to simply be a
PriorityQueue<Integer> instead of a Slot int wrapper, and thus removed Slot
altogether. The former code was re-using Slots but in order to do that it
needed to invoke the ordering predicate with a primitive int. The refactored
version is a bit more generic and it'd be annoying to reuse the same predicate
using the old Slot code -- I'd need to add some interface taking the primitive
ints. I'm not sure how much perf benefit there is here; so I'm going with code
that's easier to maintain.
I'll commit later today.
> JSON Facet, add hash table method for terms
> -------------------------------------------
>
> Key: SOLR-9142
> URL: https://issues.apache.org/jira/browse/SOLR-9142
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Reporter: Varun Thacker
> Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_s",
> "mincount": 1
> }
> }
> }
> }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_td",
> "mincount": 1
> }
> }
> }
> }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an
> array of 2M. So we create a 2M array 1000 times for this one query which from
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a
> CountSlotAcc which doesn't assign a huge array. In this query it calls
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and
> we use the array position as the ordinal and value as the count. If we could
> improve on this it would speed things up significantly? For sub-facets we
> know the maximum cardinality can be at max the top level bucket count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]