[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Michael McCandless (JIRA) Thu, 14 Nov 2013 04:12:37 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822361#comment-13822361
 ]


Michael McCandless commented on LUCENE-5339:
--------------------------------------------

{quote}
Facet's Accumulator is similar to Lucene's Collector, the Aggregator is sort of 
a Scorer, and a FacetRequest is a sort of Query.
Actually the model after which the facets were designed was Lucene's.
The optional IndexingParams came before the IndexWriterConfig but these can be 
said to be similar as well.
{quote}

I appreciate those analogies but I think the two cases are very
different: I think faceting is (ought to be) far simpler than
searching.

bq. More low-level objects such as the CategoryListParams are not a must, and 
the user may never know about them (and btw, they are similar to Codecs).

Likewise, I don't think we need to expose "codec like control" /
pluggability over facet ords encoding at this point.

bq. I reviewed the patch (mostly the taxonomy related part) and I think that 
even without associations, counts only is a bit narrow.

I added ValueSource aggregation in the next patch, but not
associations; I think associations can come later (it's just another
index time and search time impl).

{quote}
Specially with large counts (say many thousands) the count doesn't say much 
because of the "long tail" problem.
When there's a large result set, all the categories will get high hit counts. 
And just as scoring by counting the number of query terms each document matches 
doesn't always make much sense (and I think all scoring functions do things a 
lot smarter), using counts for facets may at times yield irrelevant results.

We found out that for large result sets, an aggregation of Lucene's score 
(rather than +1), or even score^2 yields better results for the user. Also 
arbitrary expressions which are corpus specific (with or without associations) 
changes the facets' usability dramatically. That's partially why the code was 
built to allow different "aggregation" techniques, allowing associations, 
numeric values etc into each value for each category.
{quote}

I agree.

Do you think ValueSource faceting is sufficient for such apps?  Or do
they "typically" use associations?  Aren't associations only really
required in the multi-valued facet field case?

bq. As for the new API, it may be useful if there would be a single "interface" 
- so all facets implementations could be switched easily, allowing users to 
experiment with the different implementations without writing a lot of code.

Yeah I think so too ... it's on the TODO list.  Especially, if the
FacetsConfig knows the facet method used by a given field, then we
could (almost) produce the right impl at search time.


> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch, LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Reply via email to