[ 
https://issues.apache.org/jira/browse/LUCENE-5476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922234#comment-13922234
 ] 

Shai Erera commented on LUCENE-5476:
------------------------------------

Thanks Rob. Few comments:

* I don't think that we need both totalHits and segmentHits. I know it may look 
not so expensive, but if you think about these ++ for millions of hits, they 
add up. Instead, I think we should stick w/ the original totalHits per-segment 
and in RandomSamplingFC do a first pass to sum up the global totalHits.
** With that I think no changes are needed to FacetsCollector?

About RandomSamplingFacetsCollector:

* I think we should fix the class jdoc's last "Note" as follows: "Note: if you 
use a counting {\@link Facets} implementation, you can fix the sampled counts 
by calling...".

* Also, I think instead of correctFacetCounts we should call it 
amortizeFacetCounts or something like that. We do not implement here the exact 
facet counting method that was before.

* I see that you remove sampleRatio, and now the ratio is computed as 
threshold/totalDocs but I think that's ... wrong? I.e. if threshold=10 and 
totalHits = 1000, I'll still get only 10 documents. But this is not what 
threshold means.
** I think we should have minSampleSize, below which we don't sample at all 
(that's the _threshold_)
** sampleRatio (e.g. 1%) is used only if totalHits > minSampleSize, and even 
then, we make sure that we sample at least minSampleSize
** If we will have maxSampleSize as well, we can take that into account too, 
but it's OK if we don't do this in this issue

* createSample seems to be memory-less -- i.e. it doesn't carry over the bin 
residue to the next segment. Not sure if it's critical though, but it might 
affect the total sample size. If you feel like getting closer to the optimum, 
want to fix it? Otherwise, can you please drop a TODO?

* Also, do you want to test using WAH8DocIdSet instead of FixedBitSet for the 
sampled docs? If not, could you please drop a TODO that we could use a more 
efficient bitset impl since it's a sparse vector?

About the test:

* Could you remove the 's' from collectors in the test name?
* Could you move to numDocs being a random number -- something like 
atLeast(8000)?
* I don't mean to nitpick but if you obtain an NRT reader, no need to commit() 
:)
* Make the two collector instances take 100/10% of the numDocs when you fix it
* Maybe use random.nextInt(10) for the facets instead of alternating 
sequentially?
* I don't understand how you know that numChildren=5 when you ask for the 10 
top children. Isn't it possible that w/ some random seed the number of children 
will be different?
** In fact, I think that the random collectors should be initialized w/ a 
random seed that depends on the test? Currently they aren't and so always use 
0xdeadbeef?
* You have some sops left at the end of the test

> Facet sampling
> --------------
>
>                 Key: LUCENE-5476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5476
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Rob Audenaerde
>         Attachments: LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> LUCENE-5476.patch, LUCENE-5476.patch, LUCENE-5476.patch, 
> SamplingComparison_SamplingFacetsCollector.java, SamplingFacetsCollector.java
>
>
> With LUCENE-5339 facet sampling disappeared. 
> When trying to display facet counts on large datasets (>10M documents) 
> counting facets is rather expensive, as all the hits are collected and 
> processed. 
> Sampling greatly reduced this and thus provided a nice speedup. Could it be 
> brought back?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to