[jira] [Updated] (SOLR-9142) JSON Facet, add hash table method for terms

David Smiley (JIRA) Wed, 31 Aug 2016 10:42:38 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley updated SOLR-9142:
-------------------------------
    Attachment: SOLR_9412_FacetFieldProcessorByHashDV.patch

Updated path to fix a bug:

While running tests I noticed an odd failure in TestRandomDVFaceting, which 
doesn't explicitly use the JSON Facet API, however it does set facet.method=uif 
and it turns out SimpleFacets.java calls out to JSON Facet API to do it.  Wow; 
you learn something new every day, as they say.  Of course, setting the method 
doesn't necessarily mean that UIF will be used, and in the case of a single 
valued number (score_f field which is a float) -- it certainly won't be -- it 
uses this hash method.  TestRandomDVFaceting is an awesome test -- very 
thorough.  And it tickled a bug in my refactoring/consolidation of findTopSlots 
that occurs when there are more collected values than the top-X you want -- 
when it's sort by count and falls-back on index order to tie-break equal counts.

So I fixed it by simplifying the use of the PriorityQueue to simply be a 
PriorityQueue<Integer> instead of a Slot int wrapper, and thus removed Slot 
altogether.  The former code was re-using Slots but in order to do that it 
needed to invoke the ordering predicate with a primitive int.  The refactored 
version is a bit more generic and it'd be annoying to reuse the same predicate 
using the old Slot code -- I'd need to add some interface taking the primitive 
ints.  I'm not sure how much perf benefit there is here; so I'm going with code 
that's easier to maintain.

I'll commit later today.

> JSON Facet, add hash table method for terms
> -------------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9142) JSON Facet, add hash table method for terms

Reply via email to