[jira] [Commented] (SOLR-7296) Reconcile facetting implementations

Shai Erera (JIRA) Wed, 25 Mar 2015 21:14:51 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381327#comment-14381327
 ]


Shai Erera commented on SOLR-7296:
----------------------------------

bq. I'm wondering if it make sense to consolidate all of the available 
implementations into the Lucene API.

I'm really glad you brought it up, because I was about to do the same :). Long 
time ago I've started to look at how to reconcile Solr faceting with Lucene's, 
as I feel that it's good if Lucene would have a rich faceting module, including 
all the goodness of Solr's, and that Solr could reuse that module, adding sugar 
APIs (such as JSON request/responses, caches, schema etc.). This has been my 
view on many components always, except maybe the distribution pieces, which may 
not have a place of their own in Lucene (although some might disagree).

But I'll admit that the FacetComponent looked scary and complicated, and I 
didn't have much time (at the time) to dig deep into it and attempt a 
refactoring, so I gave up. If there's reconciliation attempts now, I definitely 
think we should revisit this.

bq. My main concern is the use of a central taxonomy service as it seems to 
collide with SolrCloud. I do not know if it is possible to avoid this service 
and what the cost would be.

Well first, the module offers two code paths -- with and without a sidecar 
taxonomy index. The original module came with the sidecar taxonomy index and 
it's been used a lot more than the other path, and therefore also richer in 
functionality. The second path uses DocValues only (SortedSet), which is what I 
believe Solr's DocValues faceting uses.

Our benchmarks (I can look up the results in one of the many refactoring issues 
[~mikemccand] and I worked on) show that the taxonomy path performs something 
like 20% better than DocValues, since it doesn't need to map facet ordinals 
across segments (i.e. in seg1 "Author/John Doe" received ordinal=3 and in seg2 
ordinal=89). It also requires less memory footprint, although that's a bit 
debatable since TaxonomyReader maintains a cache...

A third comment is about range faceting, which uses NumericDocValues, and not 
SortedSet/Taxonomy anymore, so at least there the user doesn't have to choose 
anymore.

There are some features that exist in the taxonomy path and not in the 
DocValues path for two main reasons:

(1) We started the new path using DocValues with counting only, in order to 
explore the DV path, to get APIs straightened out and to get feedback from 
users.

(2) The taxonomy path allows you to associate an arbitrary byte[] with a facet 
value, e.g. Topic/Apache Solr (0.87). If this facet was generated by a 
classification component, you can associate the confidence score with the 
facet, which you can later factor in when you "weigh" facet values. I no longer 
use the word counting because it allows you to compute multiple functions over 
it.

The DocValues path didn't make it possible, because we don't have a DV type 
that can handle that. SortedSet is almost perfect, as it computes the string 
ordinals, but it doesn't allow you to associate a byte[]. Perhaps we should 
explore that. The only DV type that allows you to do that today is Binary, but 
then you don't get the ordinals (efficient encoding and lookups), unless you 
use the taxonomy index. The taxonomy can be viewed as a hierarchical 
Map<String,Integer> in that regard, and that's how the taxonomy path works -- 
it assigns an integer ID to every facet value, and in the search index we use 
BinaryDV to encode the <facetID, byte[]> pairs.

I started to explore ways to do both, i.e. create a taxonomy inside the search 
index, but I never had time to complete it. The idea was to use SortedSetDV and 
its ordinals on one hand, but still encode facets in a BinaryDV to allow you to 
associate values and compute functions. It didn't look simple and I still felt 
(and feel) that we should offer another type of DV which does both. Something 
like a CompositeDV which allows you to use SortedSet and Numeric/Binary DVs 
together.

The taxonomy path also includes hierarchical support. Mike and I talked about 
adding it to the DV path as well, IIRC using similar techniques as what Solr 
does -- on indexing {{Category/Computer Science/Apache/Lucene}}, we would index 
{{1/Category}}, {{2/Category/Computer Science}}... but we wanted to avoid the 
encoding of Category over and over again (since the prefix now contains the 
level). This is one of the things that I think can be added to the DV path 
(progress, not perfection :)), but I guess we didn't have time and no body else 
thought it's important enough to contribute a patch.

bq. IIRC it requires a sidecar index, which is probably its main negative.

I never really understood that I'll admit. Doesn't Solr already do that with 
other components such as spelling and suggester? Don't they carry sidecar files 
such as their dictionaries. From what I know, AnalyzingInfixSuggester builds 
its own sidecar Lucene index -- why is that not a negative for Solr, but a 
persistent sidecar Map<String,Integer> is?

Lucene's replication module (which as a side note, I also think should be 
reconciled w/ Soir's replication) even handles replicating a taxonomy index 
together with a search index. I assume Solr does something to replicate 
suggester dictionaries when a node peer-syncs?

-------

To conclude, I truly believe it would be beneficial to reconcile Solr faceting 
with Lucene's. The terms faceting can be added just like that to Lucene, it 
won't collide with anything and I believe it can just be under its own 
o.a.l.facet.terms package. DocValues faceting can be integrated with the 
current DV faceting. Range already exists - we should add all of Solr's 
goodness to it. Hierarchical DV faceting can be added just like Solr does it 
today (I admit I don't know how it does it today though...) and we can improve 
later. 

And perhaps people should stop worrying of the sidecar taxonomy index, as Solr 
already carries sidecars today ;).

If you guys want to tag team on it, I'll gladly help. I know the Lucene side of 
faceting (and I'm not stuck on it - I'm open to changes!), I need someone who 
knows the Solr side. Even if that someone knows it at a shallow level, he's 
already an expert compared to me :).

> Reconcile facetting implementations
> -----------------------------------
>
>                 Key: SOLR-7296
>                 URL: https://issues.apache.org/jira/browse/SOLR-7296
>             Project: Solr
>          Issue Type: Task
>          Components: faceting
>            Reporter: Steve Molloy
>
> SOLR-7214 introduced a new way of controlling faceting, the unmbrella 
> SOLR-6348 brings a lot of improvements in facet functionality, namely around 
> pivots. Both make a lot of sense from a user perspective, but currently have 
> completely different implementations. With the analytics components, this 
> makes 3 implementation of the same logic, which is bound to behave 
> differently as time goes by. We should reconcile all implementations to ease 
> maintenance and offer consistent behaviour no matter how parameters are 
> passed to the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7296) Reconcile facetting implementations

Reply via email to