Nick,

Thanks for thinking this through, it's certainly subtle and very unclear what 
is a "good" solution :(

I have a couple of thoughts, firstly about the guarantees we currently offer 
and then wondering whether there is an opportunity to improve our API by 
offering a single guarantee across all request types rather than bifurcating 
guarantees.

---

The first point is that, by my reasoning, CouchDB 2.x doesn't actually don't 
offer a point-in-time guarantee of the following sort currently. I read this as 
your saying Couch does offer this guarantee, apologies if I'm misreading:

> Document the API behavior change that it may
> present a view of the data is never a point-in-time[4] snapshot of the
> DB.
...
> [4] For example they have a constraint that documents "a" and "z"
> cannot both be in the database at the same time. But when iterating
> it's possible that "a" was there at the start. Then by the end, "a"
> was removed and "z" added, so both "a" and "z" would appear in the
> emitted stream. Note that FoundationDB has APIs which exhibit the same
> "relaxed" constrains:
> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality

I don't believe we offer this guarantee because different database shards will 
respond to the scatter-gather inherent to a single global query type request at 
different times. This means that, given the following sequence of events:

(1) The shard containing "a" may start returning at time N.
(2) "a" may be deleted at N+1, but (1) will still be streaming from time N.
(3) "z" may be written to a second shard at time N+2.
(4) that second shard may not start returning until time N+3.

By my reasoning, "a" and "z" could thus appear in the same result set in 
current CouchDB, even if they never actually appear in the primary data at the 
same time (regardless of latency of shard replicas coming into agreement), 
voiding [4].

By my reckoning, you have point-in-time across a query request when you are 
working with a single shard, meaning we do have point in time for two scenarios:

- Partitioned queries.
- Q=1 databases.

Albeit this guarantee is still talking about the point in time of a single 
shard's replica rather than all replicas, meaning that further requests may 
produce different results if the shards are not in agreement. Which can perhaps 
be fixed by using stable=true.

I _think_ the working here is correct, but I'd welcome corrections in my 
understanding!

---

Our current behaviour seems extremely subtle and, I'd argue, unexpected. It is 
hard to reason about if you really need a particular guarantee.

Is there an opportunity to clarify behaviour here, such that we really _do_ 
guarantee point-in-time within _any_ single request, but only do this by 
leveraging FoundationDB's transaction isolation semantics and as such are only 
able to offer this based on the 5s timeout in place? The request boundary 
offers a very clear cut, user-visible boundary. This would obviously need to 
cover reads/writes of single docs and so on as well as probably needing further 
work w.r.t. bulk docs etc.

This restriction may naturally loosen as FoundationDB improves and the 5s 
timeout may be increased.

In this approach, my preference would be to add a closing line to the result 
stream to contain both a bookmark (based on the FoundationDB key perhaps rather 
than the index key of itself to avoid problems with skip/limit?) and a 
complete/not-complete boolean to enable clients to avoid the extra HTTP 
round-trip for completed result sets that Nick mentions.

---

For option (F), I feel that the "it sometimes works and sometimes doesn't" 
effect of checking the update-seq to see if we can continue streaming will be a 
confusing experience. I also find something similar with option (A) where a 
single request covers potentially many points in time and so feels hard to 
reason about, although it's a bit less subtle than today.

Footnote [2] seems quite a major problem, however, with the single transaction 
approach and as Nick says, it is hard to pick "good" maximums for skip -- 
perhaps users need to just avoid use of these in the new system given its 
behaviour? It feels like there's a definite "against the grain" aspect to these.

-- 
Mike.

On Wed, 19 Feb 2020, at 22:39, Nick Vatamaniuc wrote:
> Hello everyone,
> 
> I'd like to discuss the shape and behavior of streaming APIs for CouchDB 4.x
> 
> By "streaming APIs" I mean APIs which stream data in row as it gets
> read from the database. These are the endpoints I was thinking of:
> 
>  _all_docs, _all_dbs, _dbs_info  and query results
> 
> I want to focus on what happens when FoundationDB transactions
> time-out after 5 seconds. Currently, all those APIs except _changes[1]
> feeds, will crash or freeze. The reason is because the
> transaction_too_old error at the end of 5 seconds is retry-able by
> default, so the request handlers run again and end up shoving the
> whole request down the socket again, headers and all, which is
> obviously broken and not what we want.
> 
> There are few alternatives discussed in couchdb-dev channel. I'll
> present some behaviors but feel free to add more. Some ideas might
> have been discounted on the IRC discussion already but I'll present
> them anyway in case is sparks further conversation:
> 
> A) Do what _changes[1] feeds do. Start a new transaction and continue
> streaming the data from the next key after last emitted in the
> previous transaction. Document the API behavior change that it may
> present a view of the data is never a point-in-time[4] snapshot of the
> DB.
> 
>  - Keeps the API shape the same as CouchDB <4.0. Client libraries
> don't have to change to continue using these CouchDB 4.0 endpoints
>  - This is the easiest to implement since it would re-use the
> implementation for _changes feed (an extra option passed to the fold
> function).
>  - Breaks API behavior if users relied on having a point-in-time[4]
> snapshot view of the data.
> 
> B) Simply end the stream. Let the users pass a `?transaction=true`
> param which indicates they are aware the stream may end early and so
> would have to paginate from the last emitted key with a skip=1. This
> will keep the request bodies the same as current CouchDB. However, if
> the users got all the data one request, they will end up wasting
> another request to see if there is more data available. If they didn't
> get any data they might have a too large of a skip value (see [2]) so
> would have to guess different values for start/end keys. Or impose max
> limit for the `skip` parameter.
> 
> C) End the stream and add a final metadata row like a "transaction":
> "timeout" at the end. That will let the user know to keep paginating
> from the last key onward. This won't work for `_all_dbs` and
> `_dbs_info`[3] Maybe let those two endpoints behave like _changes
> feeds and only use this for views and and _all_docs? If we like this
> choice, let's think what happens for those as I couldn't come up with
> anything decent there.
> 
> D) Same as C but to solve the issue with skips[2], emit a bookmark
> "key" of where the iteration stopped and the current "skip" and
> "limit" params, which would keep decreasing. Then user would pass
> those in "start_key=..." in the next request along with the limit and
> skip params. So something like "continuation":{"skip":599, "limit":5,
> "key":"..."}. This has the same issue with array results for
> `_all_dbs` and `_dbs_info`[3].
> 
> E) Enforce low `limit` and `skip` parameters. Enforce maximum values
> there such that response time is likely to fit in one transaction.
> This could be tricky as different runtime environments will have
> different characteristics. Also, if the timeout happens there isn't a
> a nice way to send an HTTP error since we already sent the 200
> response. The downside is that this might break how some users use the
> API, if say the are using large skips and limits already. Perhaps here
> we do both B and D, such that if users want transactional behavior,
> they specify that `transaction=true` param and only then we enforce
> low limit and skip maximums.
> 
> F) At least for `_all_docs` it seems providing a point-in-time
> snapshot view doesn't necessarily need to be tied to transaction
> boundaries. We could check the update sequence of the database at the
> start of the next transaction and if it hasn't changed we can continue
> emitting a consistent view. This can apply to C and D and would just
> determine when the stream ends. If there are no writes happening to
> the db, this could potential streams all the data just like option A
> would do. Not entirely sure if this would work for views.
> 
> So what do we think? I can see different combinations of options here,
> maybe even different for each API point. For example `_all_dbs`,
> `_dbs_info` are always A, and `_all_docs` and views default to A but
> have parameters to do F, etc.
> 
> Cheers,
> -Nick
> 
> Some footnotes:
> 
> [1] _changes feeds is the only one that works currently. It behaves as
> per RFC 
> https://github.com/apache/couchdb-documentation/blob/master/rfcs/003-fdb-seq-index.md#access-patterns.
> That is, we continue streaming the data by resetting the transaction
> object and restarting from the last emitted key (db sequence in this
> case). However, because the transaction restarts if a document is
> updated while the streaming take place, it may appear in the _changes
> feed twice. That's a behavior difference from CouchDB < 4.0 and we'd
> have to document it, since previously we presented this point-in-time
> snapshot of the database from when we started streaming.
> 
> [2] Our streaming APIs have both skips and limits. Since FDB doesn't
> currently support efficient offsets for key selectors
> (https://apple.github.io/foundationdb/known-limitations.html#dont-use-key-selectors-for-paging)
> we implemented skip by iterating over the data. This means that a skip
> of say 100000 could keep timing out the transaction without yielding
> any data.
> 
> [3] _all_dbs and _dbs_info return a JSON array so they don't have an
> obvious place to insert a last metadata row.
> 
> [4] For example they have a constraint that documents "a" and "z"
> cannot both be in the database at the same time. But when iterating
> it's possible that "a" was there at the start. Then by the end, "a"
> was removed and "z" added, so both "a" and "z" would appear in the
> emitted stream. Note that FoundationDB has APIs which exhibit the same
> "relaxed" constrains:
> https://apple.github.io/foundationdb/api-python.html#module-fdb.locality
>

Reply via email to