[
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454007#comment-15454007
]
Yonik Seeley commented on SOLR-9142:
------------------------------------
Thanks David, good improvements!
bq. DocSet is not necessarily an ordered set – so says it's javadocs. Yet our
collecting code assumes it is! For large ones it is but HashDocSet it won't be.
I think HashDocSet (as well as DocList) should be moved out of the DocSet
hierarchy. HashDocSet is currently only used as a utility class internal to
certain faceting methods.
Perhaps we could use the "Bits" interface instead when we want/require a fast
random access set.
I was surprised this adds a method (dvhash). Although perhaps convenient for
testing things out, it would be tedious in production since the best method
will depend on the domain size, which will often not be known ahead of time by
the user. For the normal "dv" method, we should definitely make it pick
hashing when the domain is much smaller than the number of unique terms in the
field. We already do stuff like this in the DV faceting to pick whether we
accumulate global ords, or accumulate local (per-seg) ords first and then do a
mapping at the end to global ords.
> JSON Facet, add hash table method for terms
> -------------------------------------------
>
> Key: SOLR-9142
> URL: https://issues.apache.org/jira/browse/SOLR-9142
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Reporter: Varun Thacker
> Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_s",
> "mincount": 1
> }
> }
> }
> }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_td",
> "mincount": 1
> }
> }
> }
> }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an
> array of 2M. So we create a 2M array 1000 times for this one query which from
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a
> CountSlotAcc which doesn't assign a huge array. In this query it calls
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and
> we use the array position as the ordinal and value as the count. If we could
> improve on this it would speed things up significantly? For sub-facets we
> know the maximum cardinality can be at max the top level bucket count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]