Shai Erera created LUCENE-4619:
----------------------------------

             Summary: Create a specialized path for facets counting
                 Key: LUCENE-4619
                 URL: https://issues.apache.org/jira/browse/LUCENE-4619
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/facet
            Reporter: Shai Erera


Mike and I have been discussing that on several issues (LUCENE-4600, 
LUCENE-4602) and on GTalk ... it looks like the current API abstractions may be 
responsible for some of the performance loss that we see, compared to 
specialized code.

During our discussion, we've decided to target a specific use case - facets 
counting and work on it, top-to-bottom by reusing as much code as possible. 
Specifically, we'd like to implement a FacetsCollector/Accumulator which can do 
only counting (i.e. respects only CountFacetRequest), no sampling, partitions 
and complements. The API allows us to do so very cleanly, and in the context of 
that issue, we'd like to do the following:

* Implement a FacetsField which takes a TaxonomyWriter, FacetIndexingParams and 
CategoryPath (List, Iterable, whatever) and adds the needed information to both 
the taxonomy index as well as the search index.
** That API is similar in nature to CategoryDocumentBuilder, only easier to 
consume -- it's just another field that you add to the Document.
** We'll have two extensions for it: PayloadFacetsField and 
DocValuesFacetsField, so that we can benchmark the two approaches. Eventually, 
one of them we believe, will be eliminated, and we'll remain w/ just one 
(hopefully the DV one).

* Implement either a FacetsAccumulator/Collector which takes a bunch of 
CountFacetRequests and returns the top-counts.
** Aggregations are done in-collection, rather than post. Note that we have 
LUCENE-4600 open for exploring that. Either we finish this exploration here, or 
do it there. Just FYI that the issue exists.
** Reuses the CategoryListIterator, IntDecoder and Aggregator code. I'll open a 
separate issue to explore improving that API to be bulk, and then we can decide 
if this specialized Collector should use those abstractions, or be really 
optimized for the facet counting case.

* At the moment, this path will assume that a document holds multiple 
dimensions, but only one value from each (i.e. no Author/Shai, Author/Mike for 
a document), and therefore use OrdPolicy.NO_PARENTS.
** Later, we'd like to explore how to have this specialized path handle the 
ALL_PARENTS case too, as it shouldn't be so hard to do.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to