[
https://issues.apache.org/jira/browse/SOLR-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381327#comment-14381327
]
Shai Erera commented on SOLR-7296:
----------------------------------
bq. I'm wondering if it make sense to consolidate all of the available
implementations into the Lucene API.
I'm really glad you brought it up, because I was about to do the same :). Long
time ago I've started to look at how to reconcile Solr faceting with Lucene's,
as I feel that it's good if Lucene would have a rich faceting module, including
all the goodness of Solr's, and that Solr could reuse that module, adding sugar
APIs (such as JSON request/responses, caches, schema etc.). This has been my
view on many components always, except maybe the distribution pieces, which may
not have a place of their own in Lucene (although some might disagree).
But I'll admit that the FacetComponent looked scary and complicated, and I
didn't have much time (at the time) to dig deep into it and attempt a
refactoring, so I gave up. If there's reconciliation attempts now, I definitely
think we should revisit this.
bq. My main concern is the use of a central taxonomy service as it seems to
collide with SolrCloud. I do not know if it is possible to avoid this service
and what the cost would be.
Well first, the module offers two code paths -- with and without a sidecar
taxonomy index. The original module came with the sidecar taxonomy index and
it's been used a lot more than the other path, and therefore also richer in
functionality. The second path uses DocValues only (SortedSet), which is what I
believe Solr's DocValues faceting uses.
Our benchmarks (I can look up the results in one of the many refactoring issues
[~mikemccand] and I worked on) show that the taxonomy path performs something
like 20% better than DocValues, since it doesn't need to map facet ordinals
across segments (i.e. in seg1 "Author/John Doe" received ordinal=3 and in seg2
ordinal=89). It also requires less memory footprint, although that's a bit
debatable since TaxonomyReader maintains a cache...
A third comment is about range faceting, which uses NumericDocValues, and not
SortedSet/Taxonomy anymore, so at least there the user doesn't have to choose
anymore.
There are some features that exist in the taxonomy path and not in the
DocValues path for two main reasons:
(1) We started the new path using DocValues with counting only, in order to
explore the DV path, to get APIs straightened out and to get feedback from
users.
(2) The taxonomy path allows you to associate an arbitrary byte[] with a facet
value, e.g. Topic/Apache Solr (0.87). If this facet was generated by a
classification component, you can associate the confidence score with the
facet, which you can later factor in when you "weigh" facet values. I no longer
use the word counting because it allows you to compute multiple functions over
it.
The DocValues path didn't make it possible, because we don't have a DV type
that can handle that. SortedSet is almost perfect, as it computes the string
ordinals, but it doesn't allow you to associate a byte[]. Perhaps we should
explore that. The only DV type that allows you to do that today is Binary, but
then you don't get the ordinals (efficient encoding and lookups), unless you
use the taxonomy index. The taxonomy can be viewed as a hierarchical
Map<String,Integer> in that regard, and that's how the taxonomy path works --
it assigns an integer ID to every facet value, and in the search index we use
BinaryDV to encode the <facetID, byte[]> pairs.
I started to explore ways to do both, i.e. create a taxonomy inside the search
index, but I never had time to complete it. The idea was to use SortedSetDV and
its ordinals on one hand, but still encode facets in a BinaryDV to allow you to
associate values and compute functions. It didn't look simple and I still felt
(and feel) that we should offer another type of DV which does both. Something
like a CompositeDV which allows you to use SortedSet and Numeric/Binary DVs
together.
The taxonomy path also includes hierarchical support. Mike and I talked about
adding it to the DV path as well, IIRC using similar techniques as what Solr
does -- on indexing {{Category/Computer Science/Apache/Lucene}}, we would index
{{1/Category}}, {{2/Category/Computer Science}}... but we wanted to avoid the
encoding of Category over and over again (since the prefix now contains the
level). This is one of the things that I think can be added to the DV path
(progress, not perfection :)), but I guess we didn't have time and no body else
thought it's important enough to contribute a patch.
bq. IIRC it requires a sidecar index, which is probably its main negative.
I never really understood that I'll admit. Doesn't Solr already do that with
other components such as spelling and suggester? Don't they carry sidecar files
such as their dictionaries. From what I know, AnalyzingInfixSuggester builds
its own sidecar Lucene index -- why is that not a negative for Solr, but a
persistent sidecar Map<String,Integer> is?
Lucene's replication module (which as a side note, I also think should be
reconciled w/ Soir's replication) even handles replicating a taxonomy index
together with a search index. I assume Solr does something to replicate
suggester dictionaries when a node peer-syncs?
-------
To conclude, I truly believe it would be beneficial to reconcile Solr faceting
with Lucene's. The terms faceting can be added just like that to Lucene, it
won't collide with anything and I believe it can just be under its own
o.a.l.facet.terms package. DocValues faceting can be integrated with the
current DV faceting. Range already exists - we should add all of Solr's
goodness to it. Hierarchical DV faceting can be added just like Solr does it
today (I admit I don't know how it does it today though...) and we can improve
later.
And perhaps people should stop worrying of the sidecar taxonomy index, as Solr
already carries sidecars today ;).
If you guys want to tag team on it, I'll gladly help. I know the Lucene side of
faceting (and I'm not stuck on it - I'm open to changes!), I need someone who
knows the Solr side. Even if that someone knows it at a shallow level, he's
already an expert compared to me :).
> Reconcile facetting implementations
> -----------------------------------
>
> Key: SOLR-7296
> URL: https://issues.apache.org/jira/browse/SOLR-7296
> Project: Solr
> Issue Type: Task
> Components: faceting
> Reporter: Steve Molloy
>
> SOLR-7214 introduced a new way of controlling faceting, the unmbrella
> SOLR-6348 brings a lot of improvements in facet functionality, namely around
> pivots. Both make a lot of sense from a user perspective, but currently have
> completely different implementations. With the analytics components, this
> makes 3 implementation of the same logic, which is bound to behave
> differently as time goes by. We should reconcile all implementations to ease
> maintenance and offer consistent behaviour no matter how parameters are
> passed to the API.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]