Hi all, CouchDB’s _changes feed has always featured a single endpoint per DB that delivers a firehose of update events. The sharding model in 2.x/3.x meant that internally each replica of a shard had its own _changes feed, and in fact we used those individual feeds to maintain secondary indexes. If you wanted to support a higher indexing throughput, you added more shards to the database. Simple.
The current implementation of _changes in FoundationDB uses a single, totally-ordered range of keys. While this is a straightforward model, it has some downsides. High throughput databases introduce a hotspot in the range-partitioned FoundationDB cluster, and there’s no natural mechanism for parallel processing of the changes. The producer/consumer asymmetry here makes it very easy to define a view that can never keep up with incoming write load. I think we should look at sharding each _changes index into a set of individual subspaces. It would help balance writes across multiple key ranges, and would provide a natural way to scale the view maintenance work to multiple processes. We could introduce a new external API to allow consumers to access the individual shard feeds directly. The existing interface would be maintained for backwards compatibility, using essentially the same logic that we have today for merging view responses from multiple shards. Some additional thoughts: - Each entry would still be indexed by a globally unique and totally-ordered sequence number, so a consumer that needed to order entries across all shards could still do so. - We could consider a few different strategies for assigning updates to shards. A natural one would be to use some form of consistent hashing to ensure updates to the same document (or the same partition) always land in the same shard. This appears to be the default behavior for both Kafka and Pulsar when publishing to partitioned topics: https://kafka.apache.org/documentation/#intro_concepts_and_terms https://pulsar.apache.org/docs/en/concepts-messaging/#routing-modes - We’ve recently had some discussions about the importance of being able to query a view that observes a consistent snapshot of a DB as it existed at some point in time. Parallelizing the index builds introduces a bit of extra complexity here, but it seems manageable and actually probably encourages us to be more concrete about the specific commit points where we can provide that guarantee. I’ll omit extra detail on this for now as it can get subtle quickly and probably detracts from the main point of this thread. - I’m not sure how I feel about asking users to select a shard count here. I guess it’s probably inevitable. The good news is that we should be able to dynamically scale shard counts up and down without any sort of data rebalancing, provided we document that changing the shard count will cause a re-mapping of partition keys to shards. - I took a look through the codebase and I think this may be a fairly compact patch. We really only consume the changes feed in two locations (one for the external API and one for the view engine). I think this makes a lot of sense but looking forward to hearing other points of view. Cheers, Adam