[
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated SOLR-9142:
-------------------------------
Attachment: SOLR_9412_FacetFieldProcessorByHashDV.patch
Updated Patch:
* The default facet method is now held in a package-accessible static field
that is toggled by a test. (similar to existing default hash table size). I
modified TestJsonFacets to use a feature of RandomizedTesting called
\@ParameterFactory that allows all of them to be tested for the same test
class. Admittedly this approach can be a little awkward when reproducing a case
(particularly in an IDE). I tend to go about it by edit the file temporarily
to work around that while debugging a test.
* Currently, it has effectively been the case that if you set method=stream,
that the sort order is ignored. I think that's bad; method should be a hint
(or at the very least resulting in an error). I changed this so that
method=stream only has an effect when sort=index asc (in addition to the
existing requirement of having an index). *this is a back-compat break* for
anyone using method=stream who forgot to explicitly set sort=index asc. Given
it's not common to set this and the “experimental” nature of this
module/feature, I think this change is okay to do in a point-release, provided
we're explicit in the release notes.
* Made method=enum work as an alias to method=stream. Some day we can add
support for this distinction — which is when we can do enum faceting that is
_not_ index ascending
* Some day this will support SortedSetDocValues so I adjusted TermOrdCalc to
not contain SortedDocValues, and instead take a Function that does the ord to
BytesRef resolution. Although annoyingly this is initialized in collectDocs().
* I refactored findTopDocs() between the Array & Hash based impls to a common
implementation in FacetFieldProcessor. Java 8 Functional methods proved
convenient to make this possible.
I think this is now committable. There is one nocommit to remind myself to
rename this class after I commit it. Also, it's tempting to consider breaking
up some of the portions of this into discrete commits (or separate issue even,
like for method=stream)... but that would be a pain and so if nobody asks me to
then I probably won't.
I plan to commit this Wednesday morning.
> JSON Facet, add hash table method for terms
> -------------------------------------------
>
> Key: SOLR-9142
> URL: https://issues.apache.org/jira/browse/SOLR-9142
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Reporter: Varun Thacker
> Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch,
> SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_s",
> "mincount": 1
> }
> }
> }
> }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
> {
> "top_facet_s": {
> "type": "terms",
> "limit": -1,
> "field": "top_facet_s",
> "mincount": 1,
> "excludeTags": "ANY",
> "facet": {
> "sub_facet_unique_s": {
> "type": "terms",
> "limit": 1,
> "field": "sub_facet_unique_td",
> "mincount": 1
> }
> }
> }
> }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an
> array of 2M. So we create a 2M array 1000 times for this one query which from
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a
> CountSlotAcc which doesn't assign a huge array. In this query it calls
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and
> we use the array position as the ordinal and value as the count. If we could
> improve on this it would speed things up significantly? For sub-facets we
> know the maximum cardinality can be at max the top level bucket count.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]