[ 
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527077#comment-13527077
 ] 

Gilad Barkai commented on LUCENE-4600:
--------------------------------------

Aggregating all doc ids first also make it easier to compute actual results 
after sampling. 
That is done by taking the sampling result top-(c)K and calculating their true 
value over all matching documents, giving the benefit of sampling and results 
which could make sense to the user (e.g in counting the end number would 
actually be the number of matching documents to this category).

As for aggregating 'on the fly' it has some other issues
* It (was?) believed that accessing the counting array during query execution 
may lead to memory cache issues. The entire counting array could be accessed 
for every document over and over, and it's not guaranteed it would fit into the 
cache (that's the CPU's one). That might not be a problem on modern hardware
* While the OS can cache all payload data itself, it gets difficult as the 
index grows. If the OS fails to cache the file, it is (again, was?) believed 
that going over the file in sequential manner once without seeks (at least by 
the current thread) would make it faster.

It sort of becoming a religion with all those "believes", as some scenarios 
used to make sense a few years ago. I'm not sure they still do. 
Can't wait to see how some of these co-exist with the benchmark results.
If all religions could have been benchmarked... ;)


                
> Facets should aggregate during collection, not at the end
> ---------------------------------------------------------
>
>                 Key: LUCENE-4600
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4600
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>
> Today the facet module simply gathers all hits (as a bitset, optionally with 
> a float[] to hold scores as well, if you will aggregate them) during 
> collection, and then at the end when you call getFacetsResults(), it makes a 
> 2nd pass over all those hits doing the actual aggregation.
> We should investigate just aggregating as we collect instead, so we don't 
> have to tie up transient RAM (fairly small for the bit set but possibly big 
> for the float[]).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to