[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915796#comment-13915796
 ] 

Rob Audenaerde commented on LUCENE-5476:
----------------------------------------

Thanks guys for the feedback (also on my language skills, I need to improve my 
English ;))

{quote}
It might be good to allow passing the random seed, for repeatable results?
{quote}
Yes! This is very sensible for testing and more 'stable' screenresults and I 
will add this.

{quote}
Another option, which would save the 2nd pass, would be to do the sampling 
during Docs.addDoc.
{quote}
I considered sampling on the 'addDocument' but I figured it would be more 
expensive as then for each hit we need to do a random() calculation.

{quote}
I think SamplingFC.createDocs should return a declared SampledDocs (see later) 
instead of anonymous class
{quote}
I also considered this. It is far better for clarity-sake but it also costs a 
copy of the original. I will try some approaches and will make sure the 
sampling is only done once. 

{quote}
I like that this impl samples per-segment as it allows to tune the sample on a 
per-segment basis. E.g. small segments (as in NRT) probably don't need to be 
sampled at all. If we allow passing different parameters such as sampleRatio, 
min/maxSampleSize, we could tune sampling per-segment.
{quote}
This was more or less by accident, but indeed seems useful. All segments need 
the same ratio of sampling though, else it would be really hard to correct the 
counts afterwards. (Or am I missing something here?)

{quote}
Maybe wrap all the parameters in a SamplingConfig?
{quote}
Yes. Very useful and makes it more stable.

{quote}
The old implementation let you specify different parameters such as sample 
size, minimum number of documents to evaluate, maximum number of documents to 
evaluate etc
{quote}

The old style sampling indeed had a fixed sample size, which I found very 
useful. However, I have not yet found a way to implement this as I do not know 
the total number of results when I start facetting, so I cannot determine the 
samplingRatio.  I could of course first count all results, but that also 
impacts performance as I would need two passes. I will give it some more 
thought, but maybe you have an idea on how to accomplish this in a better way?
 

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to