[ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-9142:
-------------------------------
    Attachment: SOLR_9412_FacetFieldProcessorByHashDV.patch

Updated Patch:
* The default facet method is now held in a package-accessible static field 
that is toggled by a test.  (similar to existing default hash table size).  I 
modified TestJsonFacets to use a feature of RandomizedTesting called 
\@ParameterFactory that allows all of them to be tested for the same test 
class. Admittedly this approach can be a little awkward when reproducing a case 
(particularly in an IDE).  I tend to go about it by edit the file temporarily 
to work around that while debugging a test.
* Currently, it has effectively been the case that if you set method=stream, 
that the sort order is ignored.   I think that's bad; method should be a hint 
(or at the very least resulting in an error). I changed this so that 
method=stream only has an effect when sort=index asc (in addition to the 
existing requirement of having an index). *this is a back-compat break* for 
anyone using method=stream who forgot to explicitly set sort=index asc.  Given 
it's not common to set this and the “experimental” nature of this 
module/feature, I think this change is okay to do in a point-release, provided 
we're explicit in the release notes.
* Made method=enum work as an alias to method=stream. Some day we can add 
support for this distinction — which is when we can do enum faceting that is 
_not_ index ascending
* Some day this will support SortedSetDocValues so I adjusted TermOrdCalc to 
not contain SortedDocValues, and instead take a Function that does the ord to 
BytesRef resolution.  Although annoyingly this is initialized in collectDocs().
* I refactored findTopDocs() between the Array & Hash based impls to a common 
implementation in FacetFieldProcessor.  Java 8 Functional methods proved 
convenient to make this possible.

I think this is now committable.  There is one nocommit to remind myself to 
rename this class after I commit it.  Also, it's tempting to consider breaking 
up some of the portions of this into discrete commits (or separate issue even, 
like for method=stream)... but that would be a pain and so if nobody asks me to 
then I probably won't.

I plan to commit this Wednesday morning.

> JSON Facet, add hash table method for terms
> -------------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch, 
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to