[ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453384#comment-15453384
 ] 

ASF subversion and git services commented on SOLR-9142:
-------------------------------------------------------

Commit 820ba9f868968e1deadbca06168d749dd728a51e in lucene-solr's branch 
refs/heads/branch_6x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=820ba9f ]

SOLR-9142: json.facet: new method=dvhash which works on terms. Also:
(1) method=stream now requires you set sort=index asc to work
(2) faceting on numerics with prefix or mincount=0 will give you an error
(3) refactored similar findTopSlots into one common one in FacetFieldProcessor
(4) new DocSet.collectSortedDocSet utility
(cherry picked from commit 7b5df8a)


> JSON Facet, add hash table method for terms
> -------------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to