[jira] [Updated] (SOLR-9142) JSON Facet, add hash table method for terms

David Smiley (JIRA) Wed, 24 Aug 2016 11:32:39 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley updated SOLR-9142:
-------------------------------
    Attachment: SOLR_9412_FacetFieldProcessorByHashDV.patch

Here's a working patch.  The patch will be easier to digest in an IDE.
* As expected it's very fast for the use-case prompting this issue.  Given 
Varun's test program on my laptop, it produced results in ~420ms compared to 
over 9 seconds for the array approach.
* My testing thus far (which is insufficient, granted) is just to locally 
modify the facet method picking code to pick this method by default (if it 
applies), and then run TestJsonFacets.  It helped me find some bugs and known 
limitations.
** nocommit: need to add testing.  I'd like to see a way of testing that varies 
the method and then tests for equivalent results.  At least, that's how I'd 
like to approach testing this enhancement versus something explicit.
* Limitations:
** Doesn't support mincount==0. I don't think it makes sense to add that here.
** Doesn't support prefix.  It could be added.
** Doesn't support multi-valued.  It could be added.
* FacetFieldProcessorByHashNumeric still has this name in the patch but should 
be renamed to FacetFieldProcessorByHashDV. I'd like to see that done in a 
separate commit to keep the history cleaner.
* There weren't *that* much changes to this class despite whatever impression 
one may have from the diff. I added stuff but didn't really change anything 
that was already there aside from a refacotring-oriented change. The 
refactoring was mostly to structure the method names/structure of 
FacetFieldProcessorByArray so that you can read both and find your way around.
** findTopSlots has lots of code and it's *so* similar in both classes; not 
good!  I didn't introduce that mess but I'd like to fix it; perhaps in a 
follow-on commit.
* I introduced a new subclass of FacetRangeProcessor.Calc that is for ordinals. 
 Perhaps this is a little hacky... I'm open to suggestions. One possibility is 
making Calc top-level in this package -- it's not just for ranges.
* Across this facet module I keep seeing the same DocSet & IndexReader 
collection code, and sometimes with TODOs to refactor.  I took a little stab at 
a DocSet utility collector and put it in it's own class for the moment.  Only 
this Hash based class uses it right now.  There are some nocommits to improve 
it further...
** DocSet is not necessarily an ordered set -- so says it's javadocs.  Yet our 
collecting code assumes it is!  For large ones it is but HashDocSet it won't 
be.  Maybe JSON Facets module always assume the DocSet has always come from the 
filter cache and maybe that cache always uses sortable ones?  I think that's a 
dangerous assumption even if it turns out to be true today.
** I propose DocSet.collect(IndexReader,Collector) exist... and we could define 
2 utility implementations to pick from -- one that's for our sorted DocSets, 
and another for unsorted that works by iterating segments first and re-scanning 
for applicable docsets.  The latter might be slow but it'd only be used on 
small DocSets.
* For numeric field faceting, we should more clearly tell the user that we 
don't really support mincount==0 or prefix so I added checks & exception 
throwing for that.

[[email protected]] can you please review this?

> JSON Facet, add hash table method for terms
> -------------------------------------------
>
>                 Key: SOLR-9142
>                 URL: https://issues.apache.org/jira/browse/SOLR-9142
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>            Reporter: Varun Thacker
>            Assignee: David Smiley
>             Fix For: 6.3
>
>         Attachments: SOLR_9412_FacetFieldProcessorByHashDV.patch
>
>
> I indexed a dataset of 2M docs
> {{top_facet_s}} has a cardinality of 1000 which is the top level facet.
> For nested facets it has two fields {{sub_facet_unique_s}} and 
> {{sub_facet_unique_td}} which are string and double and have cardinality 2M
> The nested query for the double field returns in the 1s mark always. The 
> nested query for the string field takes roughly 10s to execute.
> {code:title=nested string facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_s",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> {code:title=nested double facet|borderStyle=solid}
> q=*:*&rows=0&json.facet=
>       {
>               "top_facet_s": {
>                       "type": "terms",
>                       "limit": -1,
>                       "field": "top_facet_s",
>                       "mincount": 1,
>                       "excludeTags": "ANY",
>                       "facet": {
>                               "sub_facet_unique_s": {
>                                       "type": "terms",
>                                       "limit": 1,
>                                       "field": "sub_facet_unique_td",
>                                       "mincount": 1
>                               }
>                       }
>               }
>       }
> {code}
> I tried to dig deeper to understand why are string nested faceting that slow 
> compared to numeric field
> Since the top facet has a cardinality of 1000 we have to calculate sub facets 
> on each of them. Now the key difference was in the implementation of the two .
> For the string field, In {{FacetField#getFieldCacheCounts}} we call 
> {{createCollectAcc}} with nDocs=0 and numSlots=2M . This then initializes an 
> array of 2M. So we create a 2M array 1000 times for this one query which from 
> what I understand makes this query slow.
> For numeric fields {{FacetFieldProcessorNumeric#calcFacets}} uses a 
> CountSlotAcc which doesn't assign a huge array. In this query it calls 
> {{createCollectAcc}} with numDocs=2k and numSlots=1024 .
> In string faceting, we create the 2M array because the cardinality is 2M and 
> we use the array position as the ordinal and value as the count. If we could 
> improve on this it would speed things up significantly? For sub-facets we 
> know the maximum cardinality can be at max the top level bucket count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9142) JSON Facet, add hash table method for terms

Reply via email to