[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915796#comment-13915796 ]
Rob Audenaerde commented on LUCENE-5476: ---------------------------------------- Thanks guys for the feedback (also on my language skills, I need to improve my English ;)) {quote} It might be good to allow passing the random seed, for repeatable results? {quote} Yes! This is very sensible for testing and more 'stable' screenresults and I will add this. {quote} Another option, which would save the 2nd pass, would be to do the sampling during Docs.addDoc. {quote} I considered sampling on the 'addDocument' but I figured it would be more expensive as then for each hit we need to do a random() calculation. {quote} I think SamplingFC.createDocs should return a declared SampledDocs (see later) instead of anonymous class {quote} I also considered this. It is far better for clarity-sake but it also costs a copy of the original. I will try some approaches and will make sure the sampling is only done once. {quote} I like that this impl samples per-segment as it allows to tune the sample on a per-segment basis. E.g. small segments (as in NRT) probably don't need to be sampled at all. If we allow passing different parameters such as sampleRatio, min/maxSampleSize, we could tune sampling per-segment. {quote} This was more or less by accident, but indeed seems useful. All segments need the same ratio of sampling though, else it would be really hard to correct the counts afterwards. (Or am I missing something here?) {quote} Maybe wrap all the parameters in a SamplingConfig? {quote} Yes. Very useful and makes it more stable. {quote} The old implementation let you specify different parameters such as sample size, minimum number of documents to evaluate, maximum number of documents to evaluate etc {quote} The old style sampling indeed had a fixed sample size, which I found very useful. However, I have not yet found a way to implement this as I do not know the total number of results when I start facetting, so I cannot determine the samplingRatio. I could of course first count all results, but that also impacts performance as I would need two passes. I will give it some more thought, but maybe you have an idea on how to accomplish this in a better way? > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org