[ https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527418#comment-13527418 ]
Michael McCandless commented on LUCENE-4600: -------------------------------------------- {quote} bq. Net/net I think we should offer an easy-to-use DV-backed facets impl... If only DV could handle multi-values. Can they handle a single byte[]? Because essentially that's what the facets API needs today - it stores everything in the payload, which is byte[]. {quote} They can handle byte[], so I think we should just offer that. bq. Having a multi-val DV could benefit us by e.g. not needing to write an iterator on the payload to get the category ordinals ... Right, though in the special (common?) case where a given facet field is single-valued, like the Date facets I added to luceneutil / nightlybench (see the graph here: http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html -- only 3 data points so far!), we could also use DV's int fields and let it encode the single ord (eg with packed ints) and then aggreggate up the taxonomy after aggregation of the leaf ords is done. I'm playing with a prototype patch for this ... bq. Do I understand correctly that the caching Collector is reusable? Otherwise I don't see how the CachedBytes help. No no: this is all just a hack (the CachedBytes / static cache). We should somehow cleanly switch to DV ... it wasn't clear to me how to do that ... bq. Hmmm, what if you used the in-mem Codec, for loading just this term's posting list into RAM? Do you think that you would gain the same? Maybe! Have to test ... bq. If you want to make this a class that can be reused by other scenarios, then few tips that can enable that: I do! If ... making it fully generic doesn't hurt perf much. The decode chain (w/ separate reInit called per doc) seems heavyish ... bq. Instead of referencing CatListParams.DEFAULT_TERM, you can pull the CLP from FacetSearchParams.getFacetIndexingParams().getCLP(new CP()).getTerm(). Ahh ok. I'll fix that. bq. Also, you can obtain the right IntDecoder from the CLP for decoding the ordinals. That would remove the hard dependency on VInt+gap, and allow e.g. to use a PackedInts decoder. OK I'll try that. {quote} Not sure that we should, but this class supports only one CLP. I think it's ok to leave it like that, and get the CLP.term() at ctor, but then we must be able to cache the bytes at the reader level. That way, if an app uses multiple CLPs, it can initialize multi such Collectors. I think it's ok to rely on the top Query to not call us for deleted docs, and therefore pass liveDocs=null. If a Query wants to iterate on deleted docs, we should count facets for them too. {quote} OK good. bq. Maybe you should take the IntArrayAllocator from the outside? That class can be initialized by the app once to e.g. use maxArrays=10 (e.g. if it expects max 10 queries in parallel), and then the int[] are reused whenever possible. The way the patch is now, if you reuse that Collector, you can only reuse one array. Ahh I'll do that. Separately I was wondering if we should sometimes do aggregation backed by an int[] hashmap, and have it "upgrade" to a non-sparse array only once the number collected got too large. Not sure it's THAT important since it would only serve to keep fast queries fast but would make slow queries a bit slower... bq. In setNextReader you sync on the cache only in case someone executes a search w/ an ExecutorService? That's another point where caching at the Codec/AtomicReader level would be better, right? Also for multiple threads running at once ... but it's all a hack anyway ... {quote} Why is acceptDocsOutOfOrder false? Is it because of how the cache works? Because facet counting is not limited to in-order only. For the non-caching one that's true, because we can only advance on the fulltree posting. But if the posting is entirely in RAM, we can random access it? {quote} Oh good point -- the DV/cache collectors can accept out of order. I'll fix. {quote} I wonder if we can write a good single Collector, and optimize the caching stuff through the Reader, or DV. Collectors in Lucene are usually not reusable? At least, I haven't seen such pattern. The current FacetsCollector isn't reusable (b/c of the bitset and potential scores array). So I'm worried users might be confused and won't benefit the most from that Collector, b/c they won't reuse it .. On the other hand, saying that we have a FacetsIndexReader (composite) which per configuration initializes the right FacetAtomicReader would be more consumable by apps. {quote} I think we should have two new collectors here? One keeps using payloads but operates per segment and aggregates on the fly (if, on making it generic again, we still see gains). The other stores the byte[] in DV. But somehow we have to make "send the byte[] to DV not payloads at index time" easy ... I'm not sure how :) {quote} About the results, just to clarify – in both runs the 'QPS base' refers to current facet counting and 'QPS comp' refers to the two new collectors respectively? {quote} Right: base = current trunk, comp = the two new collectors. bq. I'm surprised that the int[][][] didn't perform much better, since you don't need to do the decoding for every document, for every query. But then, perhaps it's because the RAM size is so large, and we pay a lot swapping in/out from CPU cache ... This also surprised me, but I suspect it's the per-doc pointer dereferencing that's costing us. I saw the same problem with DirectPostingsFormat ... This also ties up tons of extra RAM (pointer = 4 or 8 bytes; int[] object overhead maybe 8 bytes?). I bet if we made a single int[], and did our own addressing (eg another int[] that maps docID to its address) then that would be faster than byte[] via cache/DV. bq. Also, note that you wrote a specialized code for decoding the payload, vs. using an API to do that (e.g. PackedInts / IntDecoder). I wonder how would that compare to the base collection, i.e. would we still see the big difference between int[][][] and the byte[] caching. Yeah good question. I'll separately test the specialized decode to see how much it's helping.... bq. Mike, if we do the Reader/DV caching approach, that could benefit post-collection performance too, right? Is it possile that you hack the current FacetsCollector to do the aggregation over CachedBytes and then compare the difference? Right! DV vs payloads is decoupled from during- vs post-collection aggregation. I'll open a separate issue to allow byte[] DV backing for facets.... {quote} Because your first results show that during-collection are not that much faster than post-collection, I am just wondering if it'll be the same when we cache the bytes outside the collector entirely. If so, I think it should push us to do this caching outside, because we've already identified cases where post-collection is needed (e.g. sampling) too. {quote} Definitely. {quote} Overall though, great work Mike ! We must get this code in. It's clear that it can potentially gain a lot for some scenarios ... {quote} Thanks! I want to see that graph jump :) > Explore facets aggregation during documents collection > ------------------------------------------------------ > > Key: LUCENE-4600 > URL: https://issues.apache.org/jira/browse/LUCENE-4600 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael McCandless > Attachments: LUCENE-4600.patch, LUCENE-4600.patch > > > Today the facet module simply gathers all hits (as a bitset, optionally with > a float[] to hold scores as well, if you will aggregate them) during > collection, and then at the end when you call getFacetsResults(), it makes a > 2nd pass over all those hits doing the actual aggregation. > We should investigate just aggregating as we collect instead, so we don't > have to tie up transient RAM (fairly small for the bit set but possibly big > for the float[]). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org