[jira] [Comment Edited] (SOLR-9142) Improve JSON nested facets effeciency

Joel Bernstein (JIRA) Mon, 23 May 2016 03:34:43 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296206#comment-15296206
 ]


Joel Bernstein edited comment on SOLR-9142 at 5/23/16 10:33 AM:
----------------------------------------------------------------

Just curious if you've tried method:stream?

I had some brief conversations with [[email protected]] about this, and it 
seems like this would be more efficient for high cardinality faceting. I'm not 
sure if this is supported in distributed mode, but I was planning to change the 
FacetStream to use method:stream and then handle the merge within the 
FacetStream itself.

Currently the guidance with Streaming Expressions is to use the RollupStream 
which relies on MapReduce shuffling for high cardinality faceting. But it would 
be great if we could have performant high cardinality faceting through the JSON 
facet API.


was (Author: joel.bernstein):
Just curious if you've tried method:stream?

I had some brief conversations with [[email protected]] about this, and it 
seems like this would be more efficient for high cardinality faceting. I'm not 
sure if this is supported in distributed mode, but I was planning to change the 
FacetStream to use method:stream and then handle the merge within the 
FacetStream itself.



> Improve JSON nested facets effeciency
> -------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Varun Thacker
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-9142) Improve JSON nested facets effeciency

Reply via email to