[jira] [Commented] (SOLR-9142) Improve JSON nested facets effeciency

David Smiley (JIRA) Thu, 04 Aug 2016 14:41:20 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408535#comment-15408535
 ]


David Smiley commented on SOLR-9142:
------------------------------------

Getting back to the issue here, the solution [[email protected]] recommended in 
his first comment here is to go with a hash based accumulator instead of an 
array one.  That makes perfect sense; I agree.

I've been looking very closely at this codebase more, and in a debugger with 
Varun's sample data & queries to get more familiar with it all.  When I look at 
{{FacetFieldProcessorNumeric}} it seems very close to being suitable to use on 
String data using the global ordinals as the number it works with.  I've 
thought of that and the name of this class and the name of it's 
FacetFieldProcessor compatriots and I think some refactoring is in order.  
Here's a glimpse of my thoughts on that:

h4. FacetFieldProcessors refactoring
* taste: the fact that some FFP's are declared within FacetField.java and some 
are top-level is bad IMO; they should all be top-level once any subclasses 
start becoming so.
* FFPFCBase:  This is basically the base class for _array based_ accumulator 
implementations -- i.e. direct slot/value accumulators.  I suggest rename to 
FFPArray.  It can handle terms (strings), not numbers directly but those 
encoded as terms, and multi-valued capable.
* FFPDV: Rename to FFPArrayDV: accesses terms from DocValues
* FFPUIF: Rename to FFPArrayUIF: accesses terms via UIF, kind of a pseudo-DV
* FFPNumeric: Rename to FFPHashDV:  Now currently this thing is expressly for 
single-valued numeric DocValues... but it could be made generic to handle terms 
by global ordinal.
* FFPStream: Rename to FFPEnumTerms:  This does enumeration (not hash or array 
accumulation), and it gets data from Terms.  Perhaps Stream could also go in 
the name but I think Enum is more pertinent.  One day once we have PointValues 
in Solr, we might add a FFPEnumPoints.  Note that such a thing wouldn't stream, 
since that API uses a callback API instead of an iterator style.

Most of that should be another issue, and basically renames/moves.

The only thing above not a refactoring, where there's some substantial work, is 
overhauling FFPNumeric to be FFPHashDV supporting numbers & terms.  It's 
probably multi-valued capable too.  That work could stay here.

Thoughts?

> Improve JSON nested facets effeciency
> -------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Varun Thacker
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9142) Improve JSON nested facets effeciency

Reply via email to