[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Shai Erera (JIRA) Sun, 17 Nov 2013 05:31:35 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13824833#comment-13824833
 ]


Shai Erera commented on LUCENE-5339:
------------------------------------

{quote}
I'm not sure how this can work, since in order to write the ords we
need to see all FacetFields? Ie, at what point would we compile all
the FacetFields into the BDV field?
{quote}

I was thinking when Doc.indexableFields() is called?

{quote}
Hmm, we could open that up, but ... I think that's "too late"? You
can't easily know which dim ords to add back at that point. I added a
TODO.
{quote}

Maybe that's the wrong extension point, but what I had in mind is something 
similar to what FacetFields does today -- it adds the categories to the 
TaxoIndex and receives their ordinal. Then it calls a CategoryListBuilder which 
asks for the parent of an ordinal until it hits ROOT (depending on OrdPolicy of 
course). I mentioned dedupAndEncode because I thought it does something like 
that (i.e. that you've inlined CategoryListBuilder in FacetIW). If it's not, 
then whatever method that does that ... and if there is none, let's wrap it in 
an overridable method?

bq. With the simplified APIs this user could just make a custom facet method?

He already does that with the current APIs too (a special FacetsAccumulator. I 
don't know if it's easier/harder/the same to do w/ the new API. I guess it 
won't be harder.

{quote}
You're right, the ords cache filling will be another place that "bakes
in" the decoding. So, I agree: if we can find a clean way to abstract
the encoding/source then let's pursue that.
{quote}

As I said, let's divide that into two problems: API and optimization. For API, 
we can stick w/ CategoryListIterator and implement both a 
DGapVIntBinaryDVIterator as well as OrdinalsCacheIterator. That way, 
FacetsSomething (do we have a name yet? Is it just Facets?) can use a CLI if 
they don't care where the ordinals come from.

For optimization, we do a FastFacetCounts which inlines dgap+vint and reads 
from BDV, and we can also do a CachedOrdsFacetCounts which inlines the 
interaction with OrdinalsCache. Actually, if we provide these two, we can skip 
the third FacetCounts (uses CLI), as it will be for demo purposes only given 
current encoding. If anyone changes the encoding, he can write a FacetCounts. 
Also, we can always add it later ...

The rest of the Facets (i.e. non-counts) should IMO at this point use the CLI 
abstraction. If anyone wants to optimize a SumValueSourceFacets, he can do so 
however he wants. But the CLI is the abstraction I'm thinking -- it only has 
two methods: setNextReader and getOrdinals(int doc).

{quote}
I think this is a precarious balance. If a little code dup can
greatly simplify the APIs, then that's the better tradeoff.
{quote}

In general I agree. It then becomes what's considered little vs a lot of code 
dup. I think that dgap+vint + rollup is not little (put together), as well as 
making the decision to rollup. But at this point I don't mind .. let's force 
code dup, and then simplify if users are angry.

{quote}
I agree we need better abstraction here... the 3 int we require per
unique facet label is costly. But, I don't think we need to force
SSDVFacets to use this abstraction?
{quote}

Well, at first what I had in mind is a Taxonomy interface (read-only) with two 
implementations: TaxoIndex and SortedSet. Especially since we're talking about 
supporting full hierarchies w/ SortedSet. I think it will be cool if a 
TaxoFacetCounts just takes a Taxonomy and we don't need to duplicate the code. 
But then I realized it's not just how the taxonomy is managed, but also where 
the ords are pulled from (BDV vs SSDV).

We basically could have a SortedSetTaxonomy and SortedSetCLI to support any 
'general' counting, but then as you pointed out somewhere, a SortedSetCLI may 
not be able to optimize by counting in seg-ord space and re-map afterwards. So 
at this point I'm not sure we should pursue the implementation of a SortedSet 
Taxonomy and CLI. Would be nice to know the APIs allow that (even if it means 
you lose some performance, e.g. always count in global ord-space), should 
anyone want to do that. Let's drop it for now.

> Simplify the facet module APIs
> ------------------------------
>
>                 Key: LUCENE-5339
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5339
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5339.patch, LUCENE-5339.patch
>
>
> I'd like to explore simplifications to the facet module's APIs: I
> think the current APIs are complex, and the addition of a new feature
> (sparse faceting, LUCENE-5333) threatens to add even more classes
> (e.g., FacetRequestBuilder).  I think we can do better.
> So, I've been prototyping some drastic changes; this is very
> early/exploratory and I'm not sure where it'll wind up but I think the
> new approach shows promise.
> The big changes are:
>   * Instead of *FacetRequest/Params/Result, you directly instantiate
>     the classes that do facet counting (currently TaxonomyFacetCounts,
>     RangeFacetCounts or SortedSetDVFacetCounts), passing in the
>     SimpleFacetsCollector, and then you interact with those classes to
>     pull labels + values (topN under a path, sparse, specific labels).
>   * At index time, no more FacetIndexingParams/CategoryListParams;
>     instead, you make a new SimpleFacetFields and pass it the field it
>     should store facets + drill downs under.  If you want more than
>     one CLI you create more than one instance of SimpleFacetFields.
>   * I added a simple schema, where you state which dimensions are
>     hierarchical or multi-valued.  From this we decide how to index
>     the ordinals (no more OrdinalPolicy).
> Sparse faceting is just another method (getAllDims), on both taxonomy
> & ssdv facet classes.
> I haven't created a common base class / interface for all of the
> search-time facet classes, but I think this may be possible/clean, and
> perhaps useful for drill sideways.
> All the new classes are under oal.facet.simple.*.
> Lots of things that don't work yet: drill sideways, complements,
> associations, sampling, partitions, etc.  This is just a start ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5339) Simplify the facet module APIs

Reply via email to