[ https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822360#comment-13822360 ]
Michael McCandless commented on LUCENE-5339: -------------------------------------------- bq. You have a TODO file which seems to have been 'svn added' - are you aware of it? I put a nocommit. bq. Maybe we should do this work in a branch and avoid the .simple package? I'll start a branch, but on the branch I'd like to keep working on .simple for now while we bang out the API changes. If things start to crystallize then we can start cutting everything else over? bq. Can FT.FieldType provide a ctor taking these arguments and then DEFAULT_FIELD_TYPE pass (false,false)? Or, init those two members to false. bq. Maybe instead of setHierarchical + setMultiValued you do setDimType(DimType)? Then you could avoid the synchronization? I put a nocommit. bq. I wonder if FieldTypes won't confuse users w/ Field.FieldType, so maybe you should name it DimType or something? And the upper class FacetsConfig? Good, I renamed both. bq. Not sure, but I think that SimpleFacetFields adds the dimension's ordinal if the dim is both hierarchical and multi-valued? That's a step back from the default ALL_BUT_DIM that we have today. I think we might want to have a requiresDimValue or something, because I do think the dimension's value (count) is most often unneeded, and it's a waste to encode its ordinal? It does, and I think that's OK? Yes, it's one extra ord it indexes, but that will be a minor perf hit since it's a multi-valued and hierarchical. bq. Constants isn't documented, not sure if 'precommit' will like that, but in general I think the constants should have jdocs. Maybe put a nocommit? I put nocommits. Sadly, precommit won't fail (lucene/facet is "missing" in document-lint target in build.xml, because we never fixed all its javadocs). We should separately fix that. bq. TaxonomyFacetCounts.getAllDims() – If a dimension is not hierarchical, I think SimpleFacetResult.count == 0? In that case, sorting by its count is useless? I think it's also relevant for SortedSet? I THINK these will work correctly (we sum the dim value as we visit all children), but I was missing the "corrected" dim count in the hierarchical + MV case so I added that. Also, I put nocommits to test this. bq. LabelAndValue has a Number as the value rather than double. I see this class is used only in the final step (labeling), but what's wrong w/ the previous 'double' primitive? Is it to avoid casting or something? I think it's crazy when I ask for counts that I get a double back :) That's why I switched it to a Number. bq. CategoryListIterator I appreciate the usefulness of this API, but rather that adding in into "simple" from the get-go, I'd like to build out the different facet methods and understand if it's actually useful / worth the additional abstraction. For example, I'm not sure it would work very well with SSDV, since we first count in seg-ord space an then convert to global-ord space only when combining counts across segments (this gave better performance). I mean, yes, it would "work" in that the abstraction would be correct, but we'd be paying a performance penalty. bq. E.g. that's how we were able to test fast cutting over facets to DV from Payload, that's a nice abstraction for letting you load the values from a cache, and so forth. I think doing such future tests with the simple APIs will still be easy; I don't think we should open up abstractions for the encoding/decoding today. bq. FacetsAggregator I added another facet method, TaxonomyFacetSumValueSource for value source aggregation, in the next patch, to explore this need... bq. But I don't see this convenience getting away (you'll just pass a List<FacetsAggregator> and then pull the requested values later on). True but ... we need some base class / interface that all these *Facets.java implement ... I haven't done that yet (it's a TODO). bq. What I like less about it is that it folds in the logic coded by FacetsAccumulator and FacetResultsHandler Maybe we can move some of these methods into the base class? I'm not sure though... since for the TaxonomyFacetSumValueSource it's float[] values and for the *Count it's int[] counts. At the end of the day, the code that does the facet counting, the rollup, pulling the topK, is in fact a small amount of code; I think replicating bits of this code for the different methods is the lesser evil than adding so many abstractions that the module is hard to approach by new devs/users. bq. Has variants that return a full sub-tree (TopKInEachNodeHandler) That handler is such an exotic use case ... and the app can just recurse itself, calling TaxoFacetCounts.getTopChildren? bq. FacetArrays We avoid the need to factor this out, by simply enumerating the facet impls directly. E.g. we (or a user) can make a TaxoFacetAvgScore that allocates its own int[] and float[]... bq. I do think this class has some value today, e.g. in CategoryListParams (which you nuked), the app doesn't need to know about the field under which facets are indexed at all, only if it cares to change it or index multiple category lists. The facet field is now optional (defaults to $facets = Constants.DEFAULT_FACET_FIELD); the TestXXX.testBasic now look quite simple. bq. let's put the configuration parameters under it (e.g. DimType and drillDownDelimiter)? I added nocommits; I think it's a good idea to move delim char inside here; we'd need to enforce that all fields sharing the same indexed field have the same delim char, but I think that's fine. What do you mean by DimType? Taxo vs SSDV? That's interesting to put in the DimConfig... it'd mean we (almost?) could pick the right accum/agg/results impl at search time. > Simplify the facet module APIs > ------------------------------ > > Key: LUCENE-5339 > URL: https://issues.apache.org/jira/browse/LUCENE-5339 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5339.patch, LUCENE-5339.patch > > > I'd like to explore simplifications to the facet module's APIs: I > think the current APIs are complex, and the addition of a new feature > (sparse faceting, LUCENE-5333) threatens to add even more classes > (e.g., FacetRequestBuilder). I think we can do better. > So, I've been prototyping some drastic changes; this is very > early/exploratory and I'm not sure where it'll wind up but I think the > new approach shows promise. > The big changes are: > * Instead of *FacetRequest/Params/Result, you directly instantiate > the classes that do facet counting (currently TaxonomyFacetCounts, > RangeFacetCounts or SortedSetDVFacetCounts), passing in the > SimpleFacetsCollector, and then you interact with those classes to > pull labels + values (topN under a path, sparse, specific labels). > * At index time, no more FacetIndexingParams/CategoryListParams; > instead, you make a new SimpleFacetFields and pass it the field it > should store facets + drill downs under. If you want more than > one CLI you create more than one instance of SimpleFacetFields. > * I added a simple schema, where you state which dimensions are > hierarchical or multi-valued. From this we decide how to index > the ordinals (no more OrdinalPolicy). > Sparse faceting is just another method (getAllDims), on both taxonomy > & ssdv facet classes. > I haven't created a common base class / interface for all of the > search-time facet classes, but I think this may be possible/clean, and > perhaps useful for drill sideways. > All the new classes are under oal.facet.simple.*. > Lots of things that don't work yet: drill sideways, complements, > associations, sampling, partitions, etc. This is just a start ... -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org