[ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824833#comment-13824833 ]
Shai Erera commented on LUCENE-5339: ------------------------------------ {quote} I'm not sure how this can work, since in order to write the ords we need to see all FacetFields? Ie, at what point would we compile all the FacetFields into the BDV field? {quote} I was thinking when Doc.indexableFields() is called? {quote} Hmm, we could open that up, but ... I think that's "too late"? You can't easily know which dim ords to add back at that point. I added a TODO. {quote} Maybe that's the wrong extension point, but what I had in mind is something similar to what FacetFields does today -- it adds the categories to the TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of course). I mentioned dedupAndEncode because I thought it does something like that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, then whatever method that does that ... and if there is none, let's wrap it in an overridable method? bq. With the simplified APIs this user could just make a custom facet method? He already does that with the current APIs too (a special FacetsAccumulator. I don't know if it's easier/harder/the same to do w/ the new API. I guess it won't be harder. {quote} You're right, the ords cache filling will be another place that "bakes in" the decoding. So, I agree: if we can find a clean way to abstract the encoding/source then let's pursue that. {quote} As I said, let's divide that into two problems: API and optimization. For API, we can stick w/ CategoryListIterator and implement both a DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if they don't care where the ordinals come from. For optimization, we do a FastFacetCounts which inlines dgap+vint and reads from BDV, and we can also do a CachedOrdsFacetCounts which inlines the interaction with OrdinalsCache. Actually, if we provide these two, we can skip the third FacetCounts (uses CLI), as it will be for demo purposes only given current encoding. If anyone changes the encoding, he can write a FacetCounts. Also, we can always add it later ... The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so however he wants. But the CLI is the abstraction I'm thinking -- it only has two methods: setNextReader and getOrdinals(int doc). {quote} I think this is a precarious balance. If a little code dup can greatly simplify the APIs, then that's the better tradeoff. {quote} In general I agree. It then becomes what's considered little vs a lot of code dup. I think that dgap+vint + rollup is not little (put together), as well as making the decision to rollup. But at this point I don't mind .. let's force code dup, and then simplify if users are angry. {quote} I agree we need better abstraction here... the 3 int we require per unique facet label is costly. But, I don't think we need to force SSDVFacets to use this abstraction? {quote} Well, at first what I had in mind is a Taxonomy interface (read-only) with two implementations: TaxoIndex and SortedSet. Especially since we're talking about supporting full hierarchies w/ SortedSet. I think it will be cool if a TaxoFacetCounts just takes a Taxonomy and we don't need to duplicate the code. But then I realized it's not just how the taxonomy is managed, but also where the ords are pulled from (BDV vs SSDV). We basically could have a SortedSetTaxonomy and SortedSetCLI to support any 'general' counting, but then as you pointed out somewhere, a SortedSetCLI may not be able to optimize by counting in seg-ord space and re-map afterwards. So at this point I'm not sure we should pursue the implementation of a SortedSet Taxonomy and CLI. Would be nice to know the APIs allow that (even if it means you lose some performance, e.g. always count in global ord-space), should anyone want to do that. Let's drop it for now. > Simplify the facet module APIs > ------------------------------ > > Key: LUCENE-5339 > URL: https://issues.apache.org/jira/browse/LUCENE-5339 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5339.patch, LUCENE-5339.patch > > > I'd like to explore simplifications to the facet module's APIs: I > think the current APIs are complex, and the addition of a new feature > (sparse faceting, LUCENE-5333) threatens to add even more classes > (e.g., FacetRequestBuilder). I think we can do better. > So, I've been prototyping some drastic changes; this is very > early/exploratory and I'm not sure where it'll wind up but I think the > new approach shows promise. > The big changes are: > * Instead of *FacetRequest/Params/Result, you directly instantiate > the classes that do facet counting (currently TaxonomyFacetCounts, > RangeFacetCounts or SortedSetDVFacetCounts), passing in the > SimpleFacetsCollector, and then you interact with those classes to > pull labels + values (topN under a path, sparse, specific labels). > * At index time, no more FacetIndexingParams/CategoryListParams; > instead, you make a new SimpleFacetFields and pass it the field it > should store facets + drill downs under. If you want more than > one CLI you create more than one instance of SimpleFacetFields. > * I added a simple schema, where you state which dimensions are > hierarchical or multi-valued. From this we decide how to index > the ordinals (no more OrdinalPolicy). > Sparse faceting is just another method (getAllDims), on both taxonomy > & ssdv facet classes. > I haven't created a common base class / interface for all of the > search-time facet classes, but I think this may be possible/clean, and > perhaps useful for drill sideways. > All the new classes are under oal.facet.simple.*. > Lots of things that don't work yet: drill sideways, complements, > associations, sampling, partitions, etc. This is just a start ... -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org