On 2020-05-13 10:07 a.m., Robert Samuel Newson wrote:
Hi,

Yes, I think this would be a good addition for 3.0. I think we didn't add it 
before because of concerns of accidental misuse (attempting to replicate with 
it but forgetting a range, etc)?

This was definitely a concern, but it makes me wonder: are we also potentially discussing adding per-shard _changes feeds? I wonder if the work is roughly analogous.

Whatever the reasons, I think exposing the per-partition _changes feed exactly 
as you've described will be useful. We should state explicitly in the 
accompanying docs that the replicator does not use this endpoint (though, of 
course, it might be enhanced to do so in a future release).

 From 4.0 onward, there's a discussion elsewhere on whether any of the 
_partition endpoints continue to exist (leaning towards keeping them just to 
avoid unnecessary upgrade pain?), so a note in that thread would be good too. 
It does seem odd to enhance an endpoint in 3.0 to then remove it entirely in 
4.0. The reasons for removing _partition are compelling however, as the 
motivating (internal) reason for introducing _partition is gone.

B.

On 12 May 2020, at 22:59, Adam Kocoloski <kocol...@apache.org> wrote:

Hi all,

When we introduced partitioned databases in 3.0 we declined to add a 
partition-specific _changes endpoint, because we didn’t have a prebuilt index 
that could support it. It sounds like the lack of that endpoint is a bit of a 
drag. I wanted to start this thread to consider adding it.

Note: this isn’t a fully-formed proposal coming from my team with a plan to 
staff the development of it. Just a discussion :)

In the simplest case, a _changes feed could be implemented by scanning the 
by_seq index of the shard that hosts the named partition. We already get some 
efficiencies here: we don’t need to touch any of the other shards of the 
database, and we have enough information in the by_seq btree to filter out 
documents from other partitions without actually retrieving them from disk, so 
we can push the filter down quite nicely without a lot of extra processing. 
It’s just a very cheap binary prefix pattern match on the docid.

Most consumers of the _changes feed work incrementally, and we can support that 
here as well. It’s not like we need to do a full table scan on every 
incremental request.

If the shard is hosting so many partitions that this filter is becoming a 
bottleneck, resharding (also new in 3.0) is probably a good option. Partitioned 
databases are particularly amenable to increasing the shard count. Global 
indexes on the database become more expensive to query, but those ought to be a 
smaller percentage of queries in this data model.

Finally, if the overhead of filtering out non-matching partitions is just too 
high, we could support the use of user-created indexes, e.g. by having a user 
create a Mango index on _local_seq. If such an index exists, our “query 
planner” uses it for the partitioned _changes feed. If not, resort to the scan 
on the shard’s by_seq index as above.

I’d like to do some basic benchmarking, but I have a feeling the by_seq work quite 
well in the majority of cases, and the user-defined index is a good "escape 
valve” if we need it. WDYT?

Adam

Reply via email to