[ https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Toke Eskildsen updated SOLR-5894: --------------------------------- Attachment: SOLR-5894_test.zip sparse_50510000docs_20140328-152807.png SOLR-5894.patch Patch with several bugfixes and updated test scripts. The chart sparse_50510000docs_20140328-152807.png shows results form a test run on 50M documents / 150M unique tags. The black line is standard Solr, the red is a sparse counter with 100% overhead, the cyan is an attempt of compromising with 8% sparse overhead. The performance hit for passing the 8% cut-off is huge. One way to avoid it could be to make a numhits-based guess of the expected number of unique tags, when choosing facet implementation. The consequence of a wrong guess would "just" be poorer performance, with the worst-case being the max of the black and the cyan line for any given number of unique tags in the search result. > Speed up high-cardinality facets with sparse counters > ----------------------------------------------------- > > Key: SOLR-5894 > URL: https://issues.apache.org/jira/browse/SOLR-5894 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other > Affects Versions: 4.6.1, 4.7 > Reporter: Toke Eskildsen > Priority: Minor > Fix For: 4.6.1 > > Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, > SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, > author_7M_tags_1852_logged_queries_warmed.png, > sparse_50510000docs_20140328-152807.png > > > Field based faceting in Solr has two phases: Collecting counts for tags in > facets and extracting the requested tags. > The execution time for the collecting phase is approximately linear to the > number of hits and the number of references from hits to tags. This phase is > not the focus here. > The extraction time scales with the number of unique tags in the search > result, but is also heavily influenced by the total number of unique tags in > the facet as every counter, 0 or not, is visited by the extractor (at least > for count order). For fields with millions of unique tag values this means > 10s of milliseconds added to the minimum response time (see > https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/ > for a test on a corpus with 7M unique values in the facet). > The extractor needs to visit every counter due to the current counter > structure being a plain int-array of size #unique_tags. Switching to a sparse > structure, where only the tag counters > 0 are visited, makes the extraction > time linear to the number of unique tags in the result set. > Unfortunately the number of unique tags in the result set is unknown at > collect time, so it is not possible to reliably select sparse counting vs. > full counting up front. Luckily there exists solutions for sparse sets that > has the property of switching to non-sparse-mode without a switch-penalty, > when the sparse-threshold is exceeded (see > http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This > JIRA aims to implement this functionality in Solr (a proof of concept patch > will be provided shortly). -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org