[jira] [Commented] (LUCENE-4600) Explore facets aggregation during documents collection

Michael McCandless (JIRA) Sun, 09 Dec 2012 04:13:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527418#comment-13527418
 ]


Michael McCandless commented on LUCENE-4600:
--------------------------------------------

{quote}
bq. Net/net I think we should offer an easy-to-use DV-backed facets impl...

If only DV could handle multi-values. Can they handle a single byte[]?

Because essentially that's what the facets API needs today - it stores 
everything in the payload, which is byte[]. 
{quote}

They can handle byte[], so I think we should just offer that.

bq. Having a multi-val DV could benefit us by e.g. not needing to write an 
iterator on the payload to get the category ordinals ...

Right, though in the special (common?) case where a given facet field
is single-valued, like the Date facets I added to luceneutil /
nightlybench (see the graph here:
http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html
-- only 3 data points so far!), we could also use DV's int fields and
let it encode the single ord (eg with packed ints) and then aggreggate
up the taxonomy after aggregation of the leaf ords is done.  I'm
playing with a prototype patch for this ...

bq. Do I understand correctly that the caching Collector is reusable? Otherwise 
I don't see how the CachedBytes help.

No no: this is all just a hack (the CachedBytes / static cache).  We
should somehow cleanly switch to DV ... it wasn't clear to me how to
do that ...

bq. Hmmm, what if you used the in-mem Codec, for loading just this term's 
posting list into RAM? Do you think that you would gain the same?

Maybe!  Have to test ...

bq. If you want to make this a class that can be reused by other scenarios, 
then few tips that can enable that:

I do!  If ... making it fully generic doesn't hurt perf much.  The
decode chain (w/ separate reInit called per doc) seems heavyish ...

bq. Instead of referencing CatListParams.DEFAULT_TERM, you can pull the CLP 
from FacetSearchParams.getFacetIndexingParams().getCLP(new CP()).getTerm().

Ahh ok.  I'll fix that.

bq. Also, you can obtain the right IntDecoder from the CLP for decoding the 
ordinals. That would remove the hard dependency on VInt+gap, and allow e.g. to 
use a PackedInts decoder.

OK I'll try that.

{quote}
Not sure that we should, but this class supports only one CLP. I think it's ok 
to leave it like that, and get the CLP.term() at ctor, but then we must be able 
to cache the bytes at the reader level. That way, if an app uses multiple CLPs, 
it can initialize multi such Collectors.

I think it's ok to rely on the top Query to not call us for deleted docs, and 
therefore pass liveDocs=null. If a Query wants to iterate on deleted docs, we 
should count facets for them too.
{quote}

OK good.

bq. Maybe you should take the IntArrayAllocator from the outside? That class 
can be initialized by the app once to e.g. use maxArrays=10 (e.g. if it expects 
max 10 queries in parallel), and then the int[] are reused whenever possible. 
The way the patch is now, if you reuse that Collector, you can only reuse one 
array.

Ahh I'll do that.

Separately I was wondering if we should sometimes do aggregation
backed by an int[] hashmap, and have it "upgrade" to a non-sparse
array only once the number collected got too large.  Not sure it's
THAT important since it would only serve to keep fast queries fast but
would make slow queries a bit slower...

bq. In setNextReader you sync on the cache only in case someone executes a 
search w/ an ExecutorService? That's another point where caching at the 
Codec/AtomicReader level would be better, right?

Also for multiple threads running at once ... but it's all a hack anyway ...

{quote}
Why is acceptDocsOutOfOrder false? Is it because of how the cache works? 
Because facet counting is not limited to in-order only.
For the non-caching one that's true, because we can only advance on the 
fulltree posting. But if the posting is entirely in RAM, we can random access 
it?
{quote}

Oh good point -- the DV/cache collectors can accept out of order.
I'll fix.

{quote}
I wonder if we can write a good single Collector, and optimize the caching 
stuff through the Reader, or DV. Collectors in Lucene are usually not reusable? 
At least, I haven't seen such pattern. The current FacetsCollector isn't 
reusable (b/c of the bitset and potential scores array). So I'm worried users 
might be confused and won't benefit the most from that Collector, b/c they 
won't reuse it ..
On the other hand, saying that we have a FacetsIndexReader (composite) which 
per configuration initializes the right FacetAtomicReader would be more 
consumable by apps.
{quote}

I think we should have two new collectors here?  One keeps using
payloads but operates per segment and aggregates on the fly (if, on
making it generic again, we still see gains).

The other stores the byte[] in DV.  But somehow we have to make "send
the byte[] to DV not payloads at index time" easy ... I'm not sure how :)

{quote}
About the results, just to clarify – in both runs the 'QPS base' refers to 
current facet counting and 'QPS comp' refers to the two new collectors 
respectively?
{quote}

Right: base = current trunk, comp = the two new collectors.

bq. I'm surprised that the int[][][] didn't perform much better, since you 
don't need to do the decoding for every document, for every query. But then, 
perhaps it's because the RAM size is so large, and we pay a lot swapping in/out 
from CPU cache ...

This also surprised me, but I suspect it's the per-doc pointer
dereferencing that's costing us.  I saw the same problem with
DirectPostingsFormat ... This also ties up tons of extra RAM (pointer
= 4 or 8 bytes; int[] object overhead maybe 8 bytes?).  I bet if we
made a single int[], and did our own addressing (eg another int[] that
maps docID to its address) then that would be faster than byte[] via
cache/DV.

bq. Also, note that you wrote a specialized code for decoding the payload, vs. 
using an API to do that (e.g. PackedInts / IntDecoder). I wonder how would that 
compare to the base collection, i.e. would we still see the big difference 
between int[][][] and the byte[] caching.

Yeah good question.  I'll separately test the specialized decode to
see how much it's helping....

bq. Mike, if we do the Reader/DV caching approach, that could benefit 
post-collection performance too, right? Is it possile that you hack the current 
FacetsCollector to do the aggregation over CachedBytes and then compare the 
difference?

Right!  DV vs payloads is decoupled from during- vs post-collection
aggregation.

I'll open a separate issue to allow byte[] DV backing for facets....

{quote}
Because your first results show that during-collection are not that much faster 
than post-collection, I am just wondering if it'll be the same when we cache 
the bytes outside the collector entirely.
If so, I think it should push us to do this caching outside, because we've 
already identified cases where post-collection is needed (e.g. sampling) too.
{quote}

Definitely.

{quote}
Overall though, great work Mike !
We must get this code in. It's clear that it can potentially gain a lot for 
some scenarios ...
{quote}

Thanks!  I want to see that graph jump :)

                
> Explore facets aggregation during documents collection
> ------------------------------------------------------
>
>                 Key: LUCENE-4600
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4600
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-4600.patch, LUCENE-4600.patch
>
>
> Today the facet module simply gathers all hits (as a bitset, optionally with 
> a float[] to hold scores as well, if you will aggregate them) during 
> collection, and then at the end when you call getFacetsResults(), it makes a 
> 2nd pass over all those hits doing the actual aggregation.
> We should investigate just aggregating as we collect instead, so we don't 
> have to tie up transient RAM (fairly small for the bit set but possibly big 
> for the float[]).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4600) Explore facets aggregation during documents collection

Reply via email to