[jira] [Commented] (LUCENE-5476) Facet sampling

Rob Audenaerde (JIRA) Fri, 07 Mar 2014 08:41:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924037#comment-13924037
 ]


Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

{quote}
...Given our test framework, randomness is not a big deal at all, since once we 
get a test failure, we can deterministically reproduce the failure (when there 
is no multi-threading)...
{quote}
Ok, this makes sense to me. 

{quote}
It looks like it hasn't changed? I mean besides the rename. So if I set 
sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is 
that your intention?
{quote}
Correct, it is my intention. I actually prefer not to increase the 
{{sampleSize}} with more hits, as bigger samples are slower and 100K is a nice 
sample size anyway and more hits means more time. I adjust the sampleRatio so 
that the resulting set of documents is (close to) the {{sampleSize}}.

{quote}
I find this assert just redundant – if we always expect 5, we shouldn't assert 
that we received 5. If we say that very infrequently we might get <5 and we're 
OK with it .. what's the point of asserting that at all?
{quote}
Agreed with the <5. Asserting seems redundant, but is that not the point in 
unit-tests? The trick is that the assertion should still hold if you change the 
implementation.. 

I will add more next week. 

Btw. Is there an easy way to retrieve the total facet counts for a ordinal? 
When correcting facet counts it would a quick win to limit the number of 
estimated documents to the actual number of documents in the index that match 
that facet. (And maybe use the distribution as well, to make better estimates)

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5476) Facet sampling

Reply via email to