Varun Thacker created SOLR-9142:
-----------------------------------

             Summary: Improve JSON nested facets effeciency
                 Key: SOLR-9142
                 URL: https://issues.apache.org/jira/browse/SOLR-9142
             Project: Solr
          Issue Type: Bug
            Reporter: Varun Thacker


I indexed a dataset of 2M docs

{{top_facet_s}} has a cardinality of 1000 which is the top level facet.
For nested facets it has two fields {{sub_facet_unique_s}} and 
{{sub_facet_unique_td}} which are string and double and have cardinality 2M


The nested query for the double field returns in the 1s mark always. The nested 
query for the string field takes roughly 10s to execute.

{code:title=nested string facet|borderStyle=solid}
q=*:*&rows=0&json.facet=
        {
                "top_facet_s": {
                        "type": "terms",
                        "limit": -1,
                        "field": "top_facet_s",
                        "mincount": 1,
                        "excludeTags": "ANY",
                        "facet": {
                                "sub_facet_unique_s": {
                                        "type": "terms",
                                        "limit": 1,
                                        "field": "sub_facet_unique_s",
                                        "mincount": 1
                                }
                        }
                }
        }
{code}

{code:title=nested double facet|borderStyle=solid}
q=*:*&rows=0&json.facet=
        {
                "top_facet_s": {
                        "type": "terms",
                        "limit": -1,
                        "field": "top_facet_s",
                        "mincount": 1,
                        "excludeTags": "ANY",
                        "facet": {
                                "sub_facet_unique_s": {
                                        "type": "terms",
                                        "limit": 1,
                                        "field": "sub_facet_unique_td",
                                        "mincount": 1
                                }
                        }
                }
        }
{code}

I tried to dig deeper to understand why are string nested faceting that slow 
compared to numeric field

Since the top facet has a cardinality of 1000 we have to calculate sub facets 
on each of them. Now the key difference was in the implementation of the two .

For the string field, In {{FacetField#getFieldCacheCounts}} we call 
{{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
array of 2M. So we create a 2M array 1000 times for this one query which from 
what I understand makes this query slow.

For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
CountSlotAcc which doesn't assign a huge array. In this query it calls 
{{createCollectAcc}} with numDocs=2k and numSlots=1024 .

In string faceting, we create the 2M array because the cardinality if 2M and we 
use the array position as the ordinal and value as the count. If we could 
improve on this it would speed things up significantly?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to