[jira] [Commented] (LUCENE-4600) Facets should aggregate during collection, not at the end

Shai Erera (JIRA) Sat, 08 Dec 2012 22:41:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527353#comment-13527353
 ]


Shai Erera commented on LUCENE-4600:
------------------------------------

bq. Net/net I think we should offer an easy-to-use DV-backed facets impl...

If only DV could handle multi-values. Can they handle a single byte[]? Because 
essentially that's what the facets API needs today - it stores everything in 
the payload, which is byte[]. Having a multi-val DV could benefit us by e.g. 
not needing to write an iterator on the payload to get the category ordinals ...

The patch looks very good. Few comments/questions:

* Do I understand correctly that the caching Collector is reusable? Otherwise I 
don't see how the CachedBytes help.
** Preferably, if we had an AtomicReader which caches these bytes, then you 
wouldn't need to reuse the Collector?
** Hmmm, what if you used the in-mem Codec, for loading just this term's 
posting list into RAM? Do you think that you would gain the same?

If you want to make this a class that can be reused by other scenarios, then 
few tips that can enable that:

* Instead of referencing CatListParams.DEFAULT_TERM, you can pull the CLP from 
FacetSearchParams.getFacetIndexingParams().getCLP(new CP()).getTerm().

* Also, you can obtain the right IntDecoder from the CLP for decoding the 
ordinals. That would remove the hard dependency on VInt+gap, and allow e.g. to 
use a PackedInts decoder.

* Not sure that we should, but this class supports only one CLP. I think it's 
ok to leave it like that, and get the CLP.term() at ctor, but then we must be 
able to cache the bytes at the reader level. That way, if an app uses multiple 
CLPs, it can initialize multi such Collectors.

* I think it's ok to rely on the top Query to not call us for deleted docs, and 
therefore pass liveDocs=null. If a Query wants to iterate on deleted docs, we 
should count facets for them too.

* Maybe you should take the IntArrayAllocator from the outside? That class can 
be initialized by the app once to e.g. use maxArrays=10 (e.g. if it expects max 
10 queries in parallel), and then the int[] are reused whenever possible. The 
way the patch is now, if you reuse that Collector, you can only reuse one array.

* In setNextReader you sync on the cache only in case someone executes a search 
w/ an ExecutorService? That's another point where caching at the 
Codec/AtomicReader level would be better, right?

* Why is acceptDocsOutOfOrder false? Is it because of how the cache works? 
Because facet counting is not limited to in-order only.
** For the non-caching one that's true, because we can only advance on the 
fulltree posting. But if the posting is entirely in RAM, we can random access 
it?

I wonder if we can write a good single Collector, and optimize the caching 
stuff through the Reader, or DV. Collectors in Lucene are usually not reusable? 
At least, I haven't seen such pattern. The current FacetsCollector isn't 
reusable (b/c of the bitset and potential scores array). So I'm worried users 
might be confused and won't benefit the most from that Collector, b/c they 
won't reuse it ..
On the other hand, saying that we have a FacetsIndexReader (composite) which 
per configuration initializes the right FacetAtomicReader would be more 
consumable by apps.

About the results, just to clarify -- in both runs the 'QPS base' refers to 
current facet counting and 'QPS comp' refers to the two new collectors 
respectively? I'm surprised that the int[][][] didn't perform much better, 
since you don't need to do the decoding for every document, for every query. 
But then, perhaps it's because the RAM size is so large, and we pay a lot 
swapping in/out from CPU cache ...

Also, note that you wrote a specialized code for decoding the payload, vs. 
using an API to do that (e.g. PackedInts / IntDecoder). I wonder how would that 
compare to the base collection, i.e. would we still see the big difference 
between int[][][] and the byte[] caching.

Overall though, great work Mike !

We must get this code in. It's clear that it can potentially gain a lot for 
some scenarios ...
                
> Facets should aggregate during collection, not at the end
> ---------------------------------------------------------
>
>                 Key: LUCENE-4600
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4600
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-4600.patch, LUCENE-4600.patch
>
>
> Today the facet module simply gathers all hits (as a bitset, optionally with 
> a float[] to hold scores as well, if you will aggregate them) during 
> collection, and then at the end when you call getFacetsResults(), it makes a 
> 2nd pass over all those hits doing the actual aggregation.
> We should investigate just aggregating as we collect instead, so we don't 
> have to tie up transient RAM (fairly small for the bit set but possibly big 
> for the float[]).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4600) Facets should aggregate during collection, not at the end

Reply via email to