[jira] [Commented] (LUCENE-4769) Add a CountingFacetsAggregator which reads ordinals from a cache

Shai Erera (JIRA) Mon, 11 Feb 2013 20:37:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576367#comment-13576367
 ]


Shai Erera commented on LUCENE-4769:
------------------------------------

I didn't propose that we add a DV format, I was saying that if there was one, 
then a DirectFacets format would make sense, b/c the app wouldn't need to write 
special code to work with it ... it would just return the ints more efficiently.

And we're abusing DV now, just like we abused payloads before, so nothing has 
changed :).

I did propose on another issue (forgot where, maybe the migration layer issue?) 
to develop a FacetsCodec, but you were against it. Perhaps after you worked on 
DV 2.0 you now think it makes more sense? It will solve a slew of problems I 
think.

This FacetsCodec today is mimicked by CategoryListIterator which exposes that 
getInts API. But Mike and I saw that the DV abstraction (getBytes) + CLI 
(getInts) hurts performance, therefore the \*fast\* aggregators / collectors 
sidestep the CLI abstrtaction and uses only DV. On LUCENE-4764, mike sidesteps 
the DV abstraction too, which results in more duplicated code. I'm all for 
those specializations, but it becomes harder to maintain. I just think of all 
the places we'd need to change if someone will find a better encoding than 
gap+vint :). 

Plus, the specialization doesn't serve the different facet features. I.e. if 
I'm interested in fast sum-score, I need to write a specialized one. If I'm 
interested in fast sum-association, I need to write one. Just to be clear, I'm 
not complaining and I think it makes sense for expert apps to write some 
specialized code. What I am saying is that if we could make the abstractions 
FAST, then we'd lower the bar of when apps would need to do that ...

So far, our latest optimizations only pertain to the counting case. It is the 
common case and I think it's important that we did that. Perhaps the rest of 
the API changes also improved the other cases too, but it's clear that if we 
want to really speed them up, we should specialize them.

Maybe if we had a FacetsCodec, with CategoryListFormat (an extension to Codec, 
private to Facets), then LUCENE-4764 and this issue would benefit 
out-of-the-box all facet features. Because that format will expose what facets 
need - a getInts API. And if we make this one a Codec and FastDV a Codec, then 
we anyway force the app to declare a special facets Codec, so at least from 
that aspect, we won't require more ...

And if we do a FacetsCodec w/ CategoryListFormat, then at first it can continue 
to abuse DV, but then in the future we can explore a different format to manage 
the per-document categories (and support category associations). Maybe even a 
way to manage the taxonomy in the main index, in its own data structure ...

Perhaps these two issues show the usefulness of having such Codec?
                
> Add a CountingFacetsAggregator which reads ordinals from a cache
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4769
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-4769.patch
>
>
> Mike wrote a prototype of a FacetsCollector which reads ordinals from a 
> CachedInts structure on LUCENE-4609. I ported it to the new facets API, as a 
> FacetsAggregator. I think we should offer users the means to use such a 
> cache, even if it consumes more RAM. Mike tests show that this cache consumed 
> x2 more RAM than if the DocValues were loaded into memory in their raw form. 
> Also, a PackedInts version of such cache took almost the same amount of RAM 
> as straight int[], but the gains were minor.
> I will post the patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4769) Add a CountingFacetsAggregator which reads ordinals from a cache

Reply via email to