[
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Toke Eskildsen updated SOLR-5894:
---------------------------------
Description:
Multiple performance enhancements to Solr String faceting.
* Sparse counters, switching the constant time overhead of extracting top-X
terms with time linear to result set size
* Counter re-use for reduced garbage collection and lower per-call overhead
* Optional counter packing, trading speed for space
* Improved distribution count logic, greatly reducing the overhead of
distributed faceting
* In-segment threaded faceting
* Regexp based white- and black-listing of facet terms
* Heuristic faceting for large result sets
Currently implemented for Solr 4.10 and available as source and directly usable
WAR at http://tokee.github.io/lucene-solr/
was:
Field based faceting in Solr has two phases: Collecting counts for tags in
facets and extracting the requested tags.
The execution time for the collecting phase is approximately linear to the
number of hits and the number of references from hits to tags. This phase is
not the focus here.
The extraction time scales with the number of unique tags in the search result,
but is also heavily influenced by the total number of unique tags in the facet
as every counter, 0 or not, is visited by the extractor (at least for count
order). For fields with millions of unique tag values this means 10s of
milliseconds added to the minimum response time (see
https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
for a test on a corpus with 7M unique values in the facet).
The extractor needs to visit every counter due to the current counter structure
being a plain int-array of size #unique_tags. Switching to a sparse structure,
where only the tag counters > 0 are visited, makes the extraction time linear
to the number of unique tags in the result set.
Unfortunately the number of unique tags in the result set is unknown at collect
time, so it is not possible to reliably select sparse counting vs. full
counting up front. Luckily there exists solutions for sparse sets that has the
property of switching to non-sparse-mode without a switch-penalty, when the
sparse-threshold is exceeded (see
http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This JIRA
aims to implement this functionality in Solr.
Current status: Sparse counting is implemented for field cache faceting, both
single- and multi-value, with and without doc-values. Sort by count only. The
patch applies cleanly to Solr 4.6.1 and should integrate well with everything
as all functionality is unchanged. After patching, the following new parameters
are possible:
* facet.sparse=true enables sparse faceting.
* facet.sparse.mintags=10000 the minimum amount of unique tags in the given
field for sparse faceting to be active. This is used for auto-selecting whether
sparse should be used or not.
* facet.sparse.fraction=0.08 the overhead used for the sparse tracker. Setting
this too low means that only very small result sets are handled as sparse.
Setting this too high will result in a large performance penalty if the result
set blows the sparse tracker. Values between 0.04 and 0.1 seems to work well.
* facet.sparse.packed=true use PackecInts for counters instead of int[]. This
saves memory, but performance will differ. Whether performance will be better
or worse depends on the corpus. Experiment with it.
* facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of
unique tags in the search result exceeds this fraction of the sparse tracker,
do not perform sparse tracking. The estimate is based on the assumption that
references from documents to tags are distributed randomly.
* facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and
keep in memory, ready for usage. Clearing and re-using a counter is faster that
allocating it fresh from the heap. Setting the pool size to 0 means than a new
sparse counter will be allocated each time, just as standard Solr faceting
works.
* facet.sparse.stats=true adds a special tag with timing statistics for sparse
faceting.
* facet.sparse.stats.reset=true resets the timing statistics and clears the
pool.
The parameters needs to be given together with standard faceting parameters,
such as facet=true&facet.field=myfield&facet.mincount=1&facet.sort=true. The
defaults should be usable, so simply appending facet.sparse=true to the URL is
a good start.
> Speed up high-cardinality facets with sparse counters
> -----------------------------------------------------
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
> Issue Type: Improvement
> Components: SearchComponents - other
> Affects Versions: 4.7.1
> Reporter: Toke Eskildsen
> Priority: Minor
> Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch,
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch,
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip,
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip,
> author_7M_tags_1852_logged_queries_warmed.png,
> sparse_2000000docs_fc_cutoff_20140403-145412.png,
> sparse_5000000docs_20140331-151918_multi.png,
> sparse_5000000docs_20140331-151918_single.png,
> sparse_50510000docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X
> terms with time linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly reducing the overhead of
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10 and available as source and directly
> usable WAR at http://tokee.github.io/lucene-solr/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]