[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2017-09-01 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151068#comment-16151068
 ] 

Toke Eskildsen commented on SOLR-5894:
--

I am currently doing a port to Solr master. Primarily to see if the changes are 
still beneficial with the new code base, secondarily to get to know the new 
hooks better. First take on the port is _not likely_ to be fit for inclusion in 
the main code base.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Assignee: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: author_7M_tags_1852_logged_queries_warmed.png, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2017-02-16 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869899#comment-15869899
 ] 

Toke Eskildsen commented on SOLR-5894:
--

[~alley]: Short answer: No concrete progress, sorry. We still use the 4.10 
version in production, where it continue to work well for us.

Longer answer: The primary update is that I am now a committer and that this is 
(naturally) one of the things I will pursue. But since I just became committer 
this week, I am _not_ going to touch this pile right away - I lned abunch of 
baby-steps first.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: author_7M_tags_1852_logged_queries_warmed.png, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2017-02-16 Thread Alessandro Benedetti (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869820#comment-15869820
 ] 

Alessandro Benedetti commented on SOLR-5894:


Hi Toke, 
I read your very interesting blog, but i assume we don't have any update on 
this.
It would be quite an interesting improvement ( especially for big Solr 
instances).
Is there any update on this topic or everything is still frozen ?

Cheers

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: author_7M_tags_1852_logged_queries_warmed.png, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2016-12-05 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724715#comment-15724715
 ] 

Toke Eskildsen commented on SOLR-5894:
--

No, those are fully separate issues.

Faceting is a bit of a mess with multiple implementations at multiple levels. 
See SOLR-7296 for more on this.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2016-12-05 Thread Yago Riveiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722603#comment-15722603
 ] 

Yago Riveiro commented on SOLR-5894:


Are facets with sparse counters faster that current JSON facets?

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2015-09-29 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935297#comment-14935297
 ] 

Markus Jelsma commented on SOLR-5894:
-

Hey - any news on this? 

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2015-09-29 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935340#comment-14935340
 ] 

Toke Eskildsen commented on SOLR-5894:
--

Yes. It is migrated to 4.10.x (check http://tokee.github.io/lucene-solr/) and I 
will be talking about the optimizations at Lucene/Solr Revolution 2015 next 
month. I hope to discuss the ideas with other people there and figure out where 
to go with this.

> Speed up high-cardinality facets with sparse counters
> -
>
> Key: SOLR-5894
> URL: https://issues.apache.org/jira/browse/SOLR-5894
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 4.7.1
>Reporter: Toke Eskildsen
>Priority: Minor
>  Labels: faceted-search, faceting, memory, performance
> Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
> SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
> author_7M_tags_1852_logged_queries_warmed.png, 
> sparse_200docs_fc_cutoff_20140403-145412.png, 
> sparse_500docs_20140331-151918_multi.png, 
> sparse_500docs_20140331-151918_single.png, 
> sparse_5051docs_20140328-152807.png
>
>
> Multiple performance enhancements to Solr String faceting.
> * Sparse counters, switching the constant time overhead of extracting top-X 
> terms with time overhead linear to result set size
> * Counter re-use for reduced garbage collection and lower per-call overhead
> * Optional counter packing, trading speed for space
> * Improved distribution count logic, greatly improving the performance of 
> distributed faceting
> * In-segment threaded faceting
> * Regexp based white- and black-listing of facet terms
> * Heuristic faceting for large result sets
> Currently implemented for Solr 4.10. Source, detailed description and 
> directly usable WAR at http://tokee.github.io/lucene-solr/
> This project has grown beyond a simple patch and will require a fair amount 
> of co-operation with a committer to get into Solr. Splitting into smaller 
> issues is a possibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2015-04-13 Thread Manuel Lenormand (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493153#comment-14493153
 ] 

Manuel Lenormand commented on SOLR-5894:


I'll be pleased to be updated about the 4.10.x migration, I'll be watching the 
issue. We have a 40 shards collection, 3TB/100M docs. As you can notice from 
JIRA issues I've opened, scalability and performance is our main concern and 
it's nice seeing others dealing with harder use cases than ours and writing 
about it.

 Speed up high-cardinality facets with sparse counters
 -

 Key: SOLR-5894
 URL: https://issues.apache.org/jira/browse/SOLR-5894
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 4.7.1
Reporter: Toke Eskildsen
Priority: Minor
  Labels: faceted-search, faceting, memory, performance
 Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 author_7M_tags_1852_logged_queries_warmed.png, 
 sparse_200docs_fc_cutoff_20140403-145412.png, 
 sparse_500docs_20140331-151918_multi.png, 
 sparse_500docs_20140331-151918_single.png, 
 sparse_5051docs_20140328-152807.png


 Field based faceting in Solr has two phases: Collecting counts for tags in 
 facets and extracting the requested tags.
 The execution time for the collecting phase is approximately linear to the 
 number of hits and the number of references from hits to tags. This phase is 
 not the focus here.
 The extraction time scales with the number of unique tags in the search 
 result, but is also heavily influenced by the total number of unique tags in 
 the facet as every counter, 0 or not, is visited by the extractor (at least 
 for count order). For fields with millions of unique tag values this means 
 10s of milliseconds added to the minimum response time (see 
 https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
  for a test on a corpus with 7M unique values in the facet).
 The extractor needs to visit every counter due to the current counter 
 structure being a plain int-array of size #unique_tags. Switching to a sparse 
 structure, where only the tag counters  0 are visited, makes the extraction 
 time linear to the number of unique tags in the result set.
 Unfortunately the number of unique tags in the result set is unknown at 
 collect time, so it is not possible to reliably select sparse counting vs. 
 full counting up front. Luckily there exists solutions for sparse sets that 
 has the property of switching to non-sparse-mode without a switch-penalty, 
 when the sparse-threshold is exceeded (see 
 http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
 JIRA aims to implement this functionality in Solr.
 Current status: Sparse counting is implemented for field cache faceting, both 
 single- and multi-value, with and without doc-values. Sort by count only. The 
 patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
 as all functionality is unchanged. After patching, the following new 
 parameters are possible:
 * facet.sparse=true enables sparse faceting.
 * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
 field for sparse faceting to be active. This is used for auto-selecting 
 whether sparse should be used or not.
 * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
 Setting this too low means that only very small result sets are handled as 
 sparse. Setting this too high will result in a large performance penalty if 
 the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
 work well.
 * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
 saves memory, but performance will differ. Whether performance will be better 
 or worse depends on the corpus. Experiment with it.
 * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
 unique tags in the search result exceeds this fraction of the sparse tracker, 
 do not perform sparse tracking. The estimate is based on the assumption that 
 references from documents to tags are distributed randomly.
 * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
 keep in memory, ready for usage. Clearing and re-using a counter is faster 
 that allocating it fresh from the heap. Setting the pool size to 0 means than 
 a new sparse counter will be allocated each time, just as standard Solr 
 faceting works.
 * facet.sparse.stats=true adds a special tag with timing statistics for 
 sparse faceting.
 * 

[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-08-31 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116786#comment-14116786
 ] 

Toke Eskildsen commented on SOLR-5894:
--

Status update

The code seems to be in working state for Solr 4.8 and 4.9 for Field Cache 
faceting (single- and multi-value) and DocValue faceting (single- and 
multi-value) for Strings. It really needs testing by someone else than me, so 
that the validity of the responses and the speed upc can be verified. I am 
willing to port this to trunk if a committer is willing to work on getting the 
patch into Solr, but until then I will stick to stable versions as that is what 
we use locally.

Sparse faceting introduces some architectural changes that needs to be 
addressed.

1) The core sparse counting itself is surprisingly non-invasive. It could be a 
standalone patch. However, this only works really well in a non-distributed 
setting and has less effect in a distributed one.

2) Avoiding re-allocation of counter structures by maintaining a pool of 
structures lowers the minimum faceting time and lowers the need for GC. It is 
lowered quite a lot, I might add, as faceting is normally one of the big 
GC-triggers, so I would strongly recommend this feature.
Such a pool is very much like a cache and as such must play nice with index 
updates and other caching structures in Solr. The current state of the counter 
pool is that it works well, but is implemented independently of the usual 
caches. It is probably better to latch on to the standard caching mechanisms in 
Solr, but I have no experience with that part of the code base. The code for 
the pool itself is fairly simple and can be implemented as an experimental 
feature by setting the default pool size to 0.

3) Fast distributed faceting is about speeding up the secondary phase of 
distrubuted faceting: The fine count for specific terms. This is achieved by 
replacing the normal search-for-each-requested-term approach by a full faceted 
count approach, where the counts for all terms are calculated but only the 
counts for the specified terms are returned. This requires either a separation 
of counting and extraction in stock Solr code or an extension to the 
count-extract code block so that is can handle both the case where top-X terms 
are wanted and where specific terms are requested. Currently the sparse code 
use the latter approach, but the former is a much cleaner architecture. Both 
solutions does require some re-factoring and are a bit tricky to get right.

4) A further speed up for distributed faceting can be achieved by upgrading the 
sparse pools to real caches, so that they contain a mix of empty counters (to 
avoid GC) and filled counters (caching of counts). As the counting part of the 
second phase is always exactly the same as for the first phase, cache hit-rate 
should be very high with distribution. This is not currently implemented and is 
a bit tricky to get to work well.
I recommend that this improvement is postponed, in order to keep the current 
patch manageable.

The sparse principle would also work for Field Cache per Segment faceting of 
Strings, but no code has been written for this yet. I recommend that this is 
also postponed, to get things moving.

 Speed up high-cardinality facets with sparse counters
 -

 Key: SOLR-5894
 URL: https://issues.apache.org/jira/browse/SOLR-5894
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 4.7.1
Reporter: Toke Eskildsen
Priority: Minor
  Labels: faceted-search, faceting, memory, performance
 Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 author_7M_tags_1852_logged_queries_warmed.png, 
 sparse_200docs_fc_cutoff_20140403-145412.png, 
 sparse_500docs_20140331-151918_multi.png, 
 sparse_500docs_20140331-151918_single.png, 
 sparse_5051docs_20140328-152807.png


 Field based faceting in Solr has two phases: Collecting counts for tags in 
 facets and extracting the requested tags.
 The execution time for the collecting phase is approximately linear to the 
 number of hits and the number of references from hits to tags. This phase is 
 not the focus here.
 The extraction time scales with the number of unique tags in the search 
 result, but is also heavily influenced by the total number of unique tags in 
 the facet as every counter, 0 or not, is visited by the extractor (at least 
 for count order). For fields with millions of unique tag values this means 
 10s of milliseconds added to the minimum response 

[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-08-21 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105715#comment-14105715
 ] 

Toke Eskildsen commented on SOLR-5894:
--

Forked Lucene/Solr on GitHub for better project management: 
https://github.com/tokee/lucene-solr . Experimental branches for sparse 
faceting are lucene_solr_4_8_sparse and lucene_solr_4_9_sparse and will contain 
the latest code. The patch above for 4.7.1 is a little behind the GitHub 
branches, as it does not have speed up for second phase (getting counts for 
specific terms) of distributed faceting.

 Speed up high-cardinality facets with sparse counters
 -

 Key: SOLR-5894
 URL: https://issues.apache.org/jira/browse/SOLR-5894
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 4.7.1
Reporter: Toke Eskildsen
Priority: Minor
  Labels: faceted-search, faceting, memory, performance
 Attachments: SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, SOLR-5894.patch, 
 SOLR-5894.patch, SOLR-5894.patch, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 SOLR-5894_test.zip, SOLR-5894_test.zip, SOLR-5894_test.zip, 
 author_7M_tags_1852_logged_queries_warmed.png, 
 sparse_200docs_fc_cutoff_20140403-145412.png, 
 sparse_500docs_20140331-151918_multi.png, 
 sparse_500docs_20140331-151918_single.png, 
 sparse_5051docs_20140328-152807.png


 Field based faceting in Solr has two phases: Collecting counts for tags in 
 facets and extracting the requested tags.
 The execution time for the collecting phase is approximately linear to the 
 number of hits and the number of references from hits to tags. This phase is 
 not the focus here.
 The extraction time scales with the number of unique tags in the search 
 result, but is also heavily influenced by the total number of unique tags in 
 the facet as every counter, 0 or not, is visited by the extractor (at least 
 for count order). For fields with millions of unique tag values this means 
 10s of milliseconds added to the minimum response time (see 
 https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
  for a test on a corpus with 7M unique values in the facet).
 The extractor needs to visit every counter due to the current counter 
 structure being a plain int-array of size #unique_tags. Switching to a sparse 
 structure, where only the tag counters  0 are visited, makes the extraction 
 time linear to the number of unique tags in the result set.
 Unfortunately the number of unique tags in the result set is unknown at 
 collect time, so it is not possible to reliably select sparse counting vs. 
 full counting up front. Luckily there exists solutions for sparse sets that 
 has the property of switching to non-sparse-mode without a switch-penalty, 
 when the sparse-threshold is exceeded (see 
 http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
 JIRA aims to implement this functionality in Solr.
 Current status: Sparse counting is implemented for field cache faceting, both 
 single- and multi-value, with and without doc-values. Sort by count only. The 
 patch applies cleanly to Solr 4.6.1 and should integrate well with everything 
 as all functionality is unchanged. After patching, the following new 
 parameters are possible:
 * facet.sparse=true enables sparse faceting.
 * facet.sparse.mintags=1 the minimum amount of unique tags in the given 
 field for sparse faceting to be active. This is used for auto-selecting 
 whether sparse should be used or not.
 * facet.sparse.fraction=0.08 the overhead used for the sparse tracker. 
 Setting this too low means that only very small result sets are handled as 
 sparse. Setting this too high will result in a large performance penalty if 
 the result set blows the sparse tracker. Values between 0.04 and 0.1 seems to 
 work well.
 * facet.sparse.packed=true use PackecInts for counters instead of int[]. This 
 saves memory, but performance will differ. Whether performance will be better 
 or worse depends on the corpus. Experiment with it.
 * facet.sparse.cutoff=0.90 if the estimated number (based on hitcount) of 
 unique tags in the search result exceeds this fraction of the sparse tracker, 
 do not perform sparse tracking. The estimate is based on the assumption that 
 references from documents to tags are distributed randomly.
 * facet.sparse.pool.size=2 the maximum amount of sparse trackers to clear and 
 keep in memory, ready for usage. Clearing and re-using a counter is faster 
 that allocating it fresh from the heap. Setting the pool size to 0 means than 
 a new sparse counter will be allocated each time, just as standard Solr 
 faceting works.
 * 

[jira] [Commented] (SOLR-5894) Speed up high-cardinality facets with sparse counters

2014-03-21 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943092#comment-13943092
 ] 

David Smiley commented on SOLR-5894:


Awesome work Toke!

 Speed up high-cardinality facets with sparse counters
 -

 Key: SOLR-5894
 URL: https://issues.apache.org/jira/browse/SOLR-5894
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 4.6.1, 4.7
Reporter: Toke Eskildsen
Priority: Minor
 Fix For: 4.6.1

 Attachments: SOLR-5894.patch, 
 author_7M_tags_1852_logged_queries_warmed.png


 Field based faceting in Solr has two phases: Collecting counts for tags in 
 facets and extracting the requested tags.
 The execution time for the collecting phase is approximately linear to the 
 number of hits and the number of references from hits to tags. This phase is 
 not the focus here.
 The extraction time scales with the number of unique tags in the search 
 result, but is also heavily influenced by the total number of unique tags in 
 the facet as every counter, 0 or not, is visited by the extractor (at least 
 for count order). For fields with millions of unique tag values this means 
 10s of milliseconds added to the minimum response time (see 
 https://sbdevel.wordpress.com/2014/03/18/sparse-facet-counting-on-a-real-index/
  for a test on a corpus with 7M unique values in the facet).
 The extractor needs to visit every counter due to the current counter 
 structure being a plain int-array of size #unique_tags. Switching to a sparse 
 structure, where only the tag counters  0 are visited, makes the extraction 
 time linear to the number of unique tags in the result set.
 Unfortunately the number of unique tags in the result set is unknown at 
 collect time, so it is not possible to reliably select sparse counting vs. 
 full counting up front. Luckily there exists solutions for sparse sets that 
 has the property of switching to non-sparse-mode without a switch-penalty, 
 when the sparse-threshold is exceeded (see 
 http://programmingpraxis.com/2012/03/09/sparse-sets/ for an example). This 
 JIRA aims to implement this functionality in Solr (a proof of concept patch 
 will be provided shortly).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org