[ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13821142#comment-13821142 ]
Gilad Barkai commented on LUCENE-5339: -------------------------------------- Mike, the idea of simplifying the API sounds great, but is it really that complected now? Facet's {{Accumulator}} is similar to Lucene's {{Collector}}, the {{Aggregator}} is sort of a {{Scorer}}, and a {{FacetRequest}} is a sort of {{Query}}. Actually the model after which the facets were designed was Lucene's. The optional {{IndexingParams}} came before the {{IndexWriterConfig}} but these can be said to be similar as well. More low-level objects such as the {{CategoryListParams}} are not a must, and the user may never know about them (and btw, they are similar to {{Codecs}}). I reviewed the patch (mostly the taxonomy related part) and I think that even without associations, counts only is a bit narrow. Specially with large counts (say many thousands) the count doesn't say much because of the "long tail" problem. When there's a large result set, all the categories will get high hit counts. And just as scoring by counting the number of query terms each document matches doesn't always make much sense (and I think all scoring functions do things a lot smarter), using counts for facets may at times yield irrelevant results. We found out that for large result sets, an aggregation of Lucene's score (rather than {{+1}}), or even score^2 yields better results for the user. Also arbitrary expressions which are corpus specific (with or without associations) changes the facets' usability dramatically. That's partially why the code was built to allow different "aggregation" techniques, allowing associations, numeric values etc into each value for each category. As for the new API, it may be useful if there would be a single "interface" - so all facets implementations could be switched easily, allowing users to experiment with the different implementations without writing a lot of code. Bottom line, I'm all for simplifying the API but the current cost seems to great, and I'm not sure the benefits are proportional :) > Simplify the facet module APIs > ------------------------------ > > Key: LUCENE-5339 > URL: https://issues.apache.org/jira/browse/LUCENE-5339 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5339.patch > > > I'd like to explore simplifications to the facet module's APIs: I > think the current APIs are complex, and the addition of a new feature > (sparse faceting, LUCENE-5333) threatens to add even more classes > (e.g., FacetRequestBuilder). I think we can do better. > So, I've been prototyping some drastic changes; this is very > early/exploratory and I'm not sure where it'll wind up but I think the > new approach shows promise. > The big changes are: > * Instead of *FacetRequest/Params/Result, you directly instantiate > the classes that do facet counting (currently TaxonomyFacetCounts, > RangeFacetCounts or SortedSetDVFacetCounts), passing in the > SimpleFacetsCollector, and then you interact with those classes to > pull labels + values (topN under a path, sparse, specific labels). > * At index time, no more FacetIndexingParams/CategoryListParams; > instead, you make a new SimpleFacetFields and pass it the field it > should store facets + drill downs under. If you want more than > one CLI you create more than one instance of SimpleFacetFields. > * I added a simple schema, where you state which dimensions are > hierarchical or multi-valued. From this we decide how to index > the ordinals (no more OrdinalPolicy). > Sparse faceting is just another method (getAllDims), on both taxonomy > & ssdv facet classes. > I haven't created a common base class / interface for all of the > search-time facet classes, but I think this may be possible/clean, and > perhaps useful for drill sideways. > All the new classes are under oal.facet.simple.*. > Lots of things that don't work yet: drill sideways, complements, > associations, sampling, partitions, etc. This is just a start ... -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org