Interesting Alex. So for your "merge" case, are you suggesting you
would have a different taxonomy index for each segment and would need
to merge those? I could be completely mistaken (I'm not nearly as
familiar with the indexing side of things), but I thought Lucene
maintains one single taxonomy index regardless of how many shards
there are. It should be append-only where new ordinals are created
when they're first seen, and then stay stable through merges. Or am I
misunderstanding your use-case and you're actually doing some shard
management on top of what Lucene is doing?

Cheers,
-Greg

On Thu, Apr 29, 2021 at 2:48 PM Alexander Lukyanchikov
<alexanderlukyanchi...@gmail.com> wrote:
>
> Hi Greg, Matt,
> Thank you for the responses, it's very helpful and great to hear that
> Taxonomy is successfully used for large scale products!
>
> Our biggest concern with it right now is future complications related to
> index split and merge, which we are most likely going to use to implement
> sharding and rebalancing. While split should not be too complicated
> (pre-split taxonomy works for the divided parts, and we can build an
> optimized taxonomy without unused categories in background for each new
> index), the merge seems to be challenging and probably involves tricky
> logic to translate ordinals from respective taxonomies, also taking into
> account parent-child order guarantees for hierarchical categories.
>
> I wonder if anyone implemented something similar, or have any thoughts or
> ideas about that?
>
> --
> Regards,
> Alex
>
>
> On Thu, Apr 29, 2021 at 6:08 AM Greg Miller <gsmil...@gmail.com> wrote:
>
> > Hi Alex-
> >
> > Amazon's product search engine is built on top of Lucene, which is a
> > fairly large-scale application (w.r.t. both index size, traffic and
> > use-case complexity). We have found taxonomy-based faceting to work
> > well for us generally, and haven't needed to do much to optimize
> > beyond what's already there. As you can imagine, with Amazon's catalog
> > being quite broad, we have a large number of unique facets available
> > for customers to use, which means a single facet-field storing all
> > dimensions can have high cardinality (as is the case by default with
> > taxonomy facets). This is an area where we have experimented a little
> > bit (e.g., "sharding" facets into separate fields to lower cardinality
> > of counting at query-time), but we tend to find Lucene works well
> > "as-is" for the most part in this sapce. The last bit I'll mention
> > here is that, for fields that are numeric and low-cardinality in
> > nature, LUCENE-7927
> > (https://issues.apache.org/jira/browse/LUCENE-7927) added the ability
> > to count these cases a bit more efficiently than trying to apply a
> > taxonomy-based approach.
> >
> > Happy faceting!
> >
> > Cheers,
> > -Greg
> >
> > On Thu, Apr 29, 2021 at 5:09 AM Matt Davis <kryptonics...@gmail.com>
> > wrote:
> > >
> > > Alex,
> > >
> > > We did consider trying to optimize Taxonomy indexing performance but we
> > > never really got around to it.  The sidecar index is annoying to deal
> > with
> > > and we have had occasional issues with it.  Zulia has sharding
> > implemented.
> > > The main issue here is not the taxonomy but rather just getting exact
> > > counts with returning all facets values.  We chose to implement a method
> > > similar to elastic search (
> > >
> > https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_per_bucket_document_count_error
> > ).
> > > For replication we plan to use native Lucene index replication built into
> > > lucene.  The framework is currently there for routing queries and such
> > but
> > > the actual copying of the index has not been implemented yet so I can't
> > > speak to that.  Hope this helps some.
> > >
> > > Thanks,
> > > Matt
> > >
> > >
> > > On Wed, Apr 28, 2021 at 5:48 PM Alexander Lukyanchikov <
> > > alexanderlukyanchi...@gmail.com> wrote:
> > >
> > > > Hi Matt,
> > > > It's very interesting, thanks for the response! Did you have any issues
> > > > with Taxonomy indexing performance, or maybe tried to optimize it
> > somehow?
> > > > Also, any problems maintaining a sidecar index or experience building a
> > > > distributed system around it with sharding/rebalancing?
> > > >
> > > > --
> > > > Regards,
> > > > Alex
> > > >
> > > >
> > > > On Wed, Apr 28, 2021 at 11:18 AM Matt Davis <kryptonics...@gmail.com>
> > > > wrote:
> > > >
> > > > > Alex,
> > > > >
> > > > > With our lucene based implementation of Zulia (
> > > > > https://github.com/zuliaio/zuliasearch) we have went back and
> > forth.  We
> > > > > started with Taxonomy and switched and then switched back to
> > taxonomy.
> > > > In
> > > > > our experience the Taxonomy based approach is more scalable and
> > > > > performant.   We do large searches (sometimes returning millions of
> > > > > results) with about 20 facets being run with some high cardinality
> > > > facets.
> > > > > A small dataset version of the tool that is backed by zulia we
> > released
> > > > for
> > > > > covid can be found here (
> > > > >
> > > > >
> > > >
> > https://icite.od.nih.gov/covid19/search/#search:searchId=6089a5b7218c6902d422e907
> > > > > ).
> > > > > If you click on the facet tab you can see how we use facets.  I
> > believe
> > > > the
> > > > > use case might largely drive the choice.
> > > > >
> > > > > Thanks,
> > > > > Matt
> > > > >
> > > > > On Wed, Apr 28, 2021 at 1:26 PM Alexander Lukyanchikov <
> > > > > alexanderlukyanchi...@gmail.com> wrote:
> > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > We are trying to choose between Taxonomy and
> > > > SortedSetDocValuesFacetField
> > > > > > implementations for faceted search, and based on available
> > information
> > > > > and
> > > > > > our quick tests, the difference is the following -
> > > > > >
> > > > > > - Taxonomy is faster at query time (on our test workload, the
> > > > difference
> > > > > > sometimes is higher than documented 25%). Also SortedSet adds
> > latency
> > > > to
> > > > > an
> > > > > > NRT refresh.
> > > > > > - Taxonomy is slower at index time, and unlike SortedSet
> > > > implementation,
> > > > > it
> > > > > > does not scale as good with more than 4 threads (a lot of
> > contention at
> > > > > > DirectoryTaxonomyWriter#addCategory() and
> > UTF8TaxonomyWriterCache.get()
> > > > > > synchronized blocks)
> > > > > > - SortedSet does not support hierarchical queries
> > > > > > - SortedSet does not require a sidecar index
> > > > > > - Tie-break differences for labels with the same count
> > > > > >
> > > > > > Am I missing something, or that’s everything we should take into
> > > > account
> > > > > as
> > > > > > of today?
> > > > > >
> > > > > > I know that Solr and ES use their own faceting for historical
> > reasons,
> > > > > but
> > > > > > are there any other large Lucene-based products, which have chosen
> > one
> > > > > > implementation over another? Do we know why?
> > > > > > Any insight on less known trade-offs and production experience is
> > > > greatly
> > > > > > appreciated!
> > > > > >
> > > > > > --
> > > > > > Thank you,
> > > > > > Alex
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to