Interesting Alex. So for your "merge" case, are you suggesting you would have a different taxonomy index for each segment and would need to merge those? I could be completely mistaken (I'm not nearly as familiar with the indexing side of things), but I thought Lucene maintains one single taxonomy index regardless of how many shards there are. It should be append-only where new ordinals are created when they're first seen, and then stay stable through merges. Or am I misunderstanding your use-case and you're actually doing some shard management on top of what Lucene is doing?
Cheers, -Greg On Thu, Apr 29, 2021 at 2:48 PM Alexander Lukyanchikov <alexanderlukyanchi...@gmail.com> wrote: > > Hi Greg, Matt, > Thank you for the responses, it's very helpful and great to hear that > Taxonomy is successfully used for large scale products! > > Our biggest concern with it right now is future complications related to > index split and merge, which we are most likely going to use to implement > sharding and rebalancing. While split should not be too complicated > (pre-split taxonomy works for the divided parts, and we can build an > optimized taxonomy without unused categories in background for each new > index), the merge seems to be challenging and probably involves tricky > logic to translate ordinals from respective taxonomies, also taking into > account parent-child order guarantees for hierarchical categories. > > I wonder if anyone implemented something similar, or have any thoughts or > ideas about that? > > -- > Regards, > Alex > > > On Thu, Apr 29, 2021 at 6:08 AM Greg Miller <gsmil...@gmail.com> wrote: > > > Hi Alex- > > > > Amazon's product search engine is built on top of Lucene, which is a > > fairly large-scale application (w.r.t. both index size, traffic and > > use-case complexity). We have found taxonomy-based faceting to work > > well for us generally, and haven't needed to do much to optimize > > beyond what's already there. As you can imagine, with Amazon's catalog > > being quite broad, we have a large number of unique facets available > > for customers to use, which means a single facet-field storing all > > dimensions can have high cardinality (as is the case by default with > > taxonomy facets). This is an area where we have experimented a little > > bit (e.g., "sharding" facets into separate fields to lower cardinality > > of counting at query-time), but we tend to find Lucene works well > > "as-is" for the most part in this sapce. The last bit I'll mention > > here is that, for fields that are numeric and low-cardinality in > > nature, LUCENE-7927 > > (https://issues.apache.org/jira/browse/LUCENE-7927) added the ability > > to count these cases a bit more efficiently than trying to apply a > > taxonomy-based approach. > > > > Happy faceting! > > > > Cheers, > > -Greg > > > > On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics...@gmail.com> > > wrote: > > > > > > Alex, > > > > > > We did consider trying to optimize Taxonomy indexing performance but we > > > never really got around to it. The sidecar index is annoying to deal > > with > > > and we have had occasional issues with it. Zulia has sharding > > implemented. > > > The main issue here is not the taxonomy but rather just getting exact > > > counts with returning all facets values. We chose to implement a method > > > similar to elastic search ( > > > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error > > ). > > > For replication we plan to use native Lucene index replication built into > > > lucene. The framework is currently there for routing queries and such > > but > > > the actual copying of the index has not been implemented yet so I can't > > > speak to that. Hope this helps some. > > > > > > Thanks, > > > Matt > > > > > > > > > On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov < > > > alexanderlukyanchi...@gmail.com> wrote: > > > > > > > Hi Matt, > > > > It's very interesting, thanks for the response! Did you have any issues > > > > with Taxonomy indexing performance, or maybe tried to optimize it > > somehow? > > > > Also, any problems maintaining a sidecar index or experience building a > > > > distributed system around it with sharding/rebalancing? > > > > > > > > -- > > > > Regards, > > > > Alex > > > > > > > > > > > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics...@gmail.com> > > > > wrote: > > > > > > > > > Alex, > > > > > > > > > > With our lucene based implementation of Zulia ( > > > > > https://github.com/zuliaio/zuliasearch) we have went back and > > forth. We > > > > > started with Taxonomy and switched and then switched back to > > taxonomy. > > > > In > > > > > our experience the Taxonomy based approach is more scalable and > > > > > performant. We do large searches (sometimes returning millions of > > > > > results) with about 20 facets being run with some high cardinality > > > > facets. > > > > > A small dataset version of the tool that is backed by zulia we > > released > > > > for > > > > > covid can be found here ( > > > > > > > > > > > > > > > > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907 > > > > > ). > > > > > If you click on the facet tab you can see how we use facets. I > > believe > > > > the > > > > > use case might largely drive the choice. > > > > > > > > > > Thanks, > > > > > Matt > > > > > > > > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov < > > > > > alexanderlukyanchi...@gmail.com> wrote: > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > We are trying to choose between Taxonomy and > > > > SortedSetDocValuesFacetField > > > > > > implementations for faceted search, and based on available > > information > > > > > and > > > > > > our quick tests, the difference is the following - > > > > > > > > > > > > - Taxonomy is faster at query time (on our test workload, the > > > > difference > > > > > > sometimes is higher than documented 25%). Also SortedSet adds > > latency > > > > to > > > > > an > > > > > > NRT refresh. > > > > > > - Taxonomy is slower at index time, and unlike SortedSet > > > > implementation, > > > > > it > > > > > > does not scale as good with more than 4 threads (a lot of > > contention at > > > > > > DirectoryTaxonomyWriter#addCategory() and > > UTF8TaxonomyWriterCache.get() > > > > > > synchronized blocks) > > > > > > - SortedSet does not support hierarchical queries > > > > > > - SortedSet does not require a sidecar index > > > > > > - Tie-break differences for labels with the same count > > > > > > > > > > > > Am I missing something, or that’s everything we should take into > > > > account > > > > > as > > > > > > of today? > > > > > > > > > > > > I know that Solr and ES use their own faceting for historical > > reasons, > > > > > but > > > > > > are there any other large Lucene-based products, which have chosen > > one > > > > > > implementation over another? Do we know why? > > > > > > Any insight on less known trade-offs and production experience is > > > > greatly > > > > > > appreciated! > > > > > > > > > > > > -- > > > > > > Thank you, > > > > > > Alex > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org