[
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527077#comment-13527077
]
Gilad Barkai commented on LUCENE-4600:
--------------------------------------
Aggregating all doc ids first also make it easier to compute actual results
after sampling.
That is done by taking the sampling result top-(c)K and calculating their true
value over all matching documents, giving the benefit of sampling and results
which could make sense to the user (e.g in counting the end number would
actually be the number of matching documents to this category).
As for aggregating 'on the fly' it has some other issues
* It (was?) believed that accessing the counting array during query execution
may lead to memory cache issues. The entire counting array could be accessed
for every document over and over, and it's not guaranteed it would fit into the
cache (that's the CPU's one). That might not be a problem on modern hardware
* While the OS can cache all payload data itself, it gets difficult as the
index grows. If the OS fails to cache the file, it is (again, was?) believed
that going over the file in sequential manner once without seeks (at least by
the current thread) would make it faster.
It sort of becoming a religion with all those "believes", as some scenarios
used to make sense a few years ago. I'm not sure they still do.
Can't wait to see how some of these co-exist with the benchmark results.
If all religions could have been benchmarked... ;)
> Facets should aggregate during collection, not at the end
> ---------------------------------------------------------
>
> Key: LUCENE-4600
> URL: https://issues.apache.org/jira/browse/LUCENE-4600
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
>
> Today the facet module simply gathers all hits (as a bitset, optionally with
> a float[] to hold scores as well, if you will aggregate them) during
> collection, and then at the end when you call getFacetsResults(), it makes a
> 2nd pass over all those hits doing the actual aggregation.
> We should investigate just aggregating as we collect instead, so we don't
> have to tie up transient RAM (fairly small for the bit set but possibly big
> for the float[]).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]