Hi Matt, It's very interesting, thanks for the response! Did you have any issues with Taxonomy indexing performance, or maybe tried to optimize it somehow? Also, any problems maintaining a sidecar index or experience building a distributed system around it with sharding/rebalancing?
-- Regards, Alex On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics...@gmail.com> wrote: > Alex, > > With our lucene based implementation of Zulia ( > https://github.com/zuliaio/zuliasearch) we have went back and forth. We > started with Taxonomy and switched and then switched back to taxonomy. In > our experience the Taxonomy based approach is more scalable and > performant. We do large searches (sometimes returning millions of > results) with about 20 facets being run with some high cardinality facets. > A small dataset version of the tool that is backed by zulia we released for > covid can be found here ( > > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907 > ). > If you click on the facet tab you can see how we use facets. I believe the > use case might largely drive the choice. > > Thanks, > Matt > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov < > alexanderlukyanchi...@gmail.com> wrote: > > > Hello everyone, > > > > We are trying to choose between Taxonomy and SortedSetDocValuesFacetField > > implementations for faceted search, and based on available information > and > > our quick tests, the difference is the following - > > > > - Taxonomy is faster at query time (on our test workload, the difference > > sometimes is higher than documented 25%). Also SortedSet adds latency to > an > > NRT refresh. > > - Taxonomy is slower at index time, and unlike SortedSet implementation, > it > > does not scale as good with more than 4 threads (a lot of contention at > > DirectoryTaxonomyWriter#addCategory() and UTF8TaxonomyWriterCache.get() > > synchronized blocks) > > - SortedSet does not support hierarchical queries > > - SortedSet does not require a sidecar index > > - Tie-break differences for labels with the same count > > > > Am I missing something, or that’s everything we should take into account > as > > of today? > > > > I know that Solr and ES use their own faceting for historical reasons, > but > > are there any other large Lucene-based products, which have chosen one > > implementation over another? Do we know why? > > Any insight on less known trade-offs and production experience is greatly > > appreciated! > > > > -- > > Thank you, > > Alex > > >