[ https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922234#comment-13922234 ]
Shai Erera commented on LUCENE-5476: ------------------------------------ Thanks Rob. Few comments: * I don't think that we need both totalHits and segmentHits. I know it may look not so expensive, but if you think about these ++ for millions of hits, they add up. Instead, I think we should stick w/ the original totalHits per-segment and in RandomSamplingFC do a first pass to sum up the global totalHits. ** With that I think no changes are needed to FacetsCollector? About RandomSamplingFacetsCollector: * I think we should fix the class jdoc's last "Note" as follows: "Note: if you use a counting {\@link Facets} implementation, you can fix the sampled counts by calling...". * Also, I think instead of correctFacetCounts we should call it amortizeFacetCounts or something like that. We do not implement here the exact facet counting method that was before. * I see that you remove sampleRatio, and now the ratio is computed as threshold/totalDocs but I think that's ... wrong? I.e. if threshold=10 and totalHits = 1000, I'll still get only 10 documents. But this is not what threshold means. ** I think we should have minSampleSize, below which we don't sample at all (that's the _threshold_) ** sampleRatio (e.g. 1%) is used only if totalHits > minSampleSize, and even then, we make sure that we sample at least minSampleSize ** If we will have maxSampleSize as well, we can take that into account too, but it's OK if we don't do this in this issue * createSample seems to be memory-less -- i.e. it doesn't carry over the bin residue to the next segment. Not sure if it's critical though, but it might affect the total sample size. If you feel like getting closer to the optimum, want to fix it? Otherwise, can you please drop a TODO? * Also, do you want to test using WAH8DocIdSet instead of FixedBitSet for the sampled docs? If not, could you please drop a TODO that we could use a more efficient bitset impl since it's a sparse vector? About the test: * Could you remove the 's' from collectors in the test name? * Could you move to numDocs being a random number -- something like atLeast(8000)? * I don't mean to nitpick but if you obtain an NRT reader, no need to commit() :) * Make the two collector instances take 100/10% of the numDocs when you fix it * Maybe use random.nextInt(10) for the facets instead of alternating sequentially? * I don't understand how you know that numChildren=5 when you ask for the 10 top children. Isn't it possible that w/ some random seed the number of children will be different? ** In fact, I think that the random collectors should be initialized w/ a random seed that depends on the test? Currently they aren't and so always use 0xdeadbeef? * You have some sops left at the end of the test > Facet sampling > -------------- > > Key: LUCENE-5476 > URL: https://issues.apache.org/jira/browse/LUCENE-5476 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Rob Audenaerde > Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, > SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java > > > With LUCENE-5339 facet sampling disappeared. > When trying to display facet counts on large datasets (>10M documents) > counting facets is rather expensive, as all the hits are collected and > processed. > Sampling greatly reduced this and thus provided a nice speedup. Could it be > brought back? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org