jcoglan opened a new issue #3939:
URL: https://github.com/apache/couchdb/issues/3939
We have observed a cluster exhibiting a race condition in how `PUT /db/doc`
with
`Content-Type: multipart/related` transfers attachment data to backend
nodes. If
two such requests happen concurrently, it is possible for the second request
to
hit a `receive` timeout that blocks the request from responding for 5
minutes.
This issue was observed on replication requests with `new_edits=false` and
part
of the code path for this request differentiates between `replicated_changes`
and `interactive_edit`. However we don't believe this problem is specific to
replication requests.
We do not believe any data is corrupted by this interaction, but this issue
highlights some paths that are worth checking for data races.
## Relevant code
The following represents the tree of function calls that are relevant to this
issue; this trace references CouchDB v3.2.1. All-caps names indicate a
process
boundary; the `HTTPD`, `PARSER` and `FABRIC` processes execute on the
coordinator node, while `WRITER` executes on the backend nodes holding the
DB’s
shard files.
- (`HTTPD`)
- [`chttpd_db:db_doc_req(#httpd{method='PUT', user_ctx=Ctx}=Req, Db,
DocId)`][101]
- [`couch_doc:doc_from_multi_part_stream`][102]
- [`couch_httpd_multipart:decode_multipart_stream`][103]
- `spawn_monitor` (`PARSER`)
- [`couch_httpd:parse_multipart_request`][104]
- [`couch_httpd_multipart:mp_parse_doc`][105]
- [`couch_httpd_multipart:mp_parse_atts`][106]
- [`couch_httpd_multipart:maybe_send_data`][107]
- [`chttpd_db:send_updated_doc`][108]
- [`chttpd_db:update_doc`][109]
- `spawn_monitor` (`FABRIC`)
- [`fabric:update_doc`][110]
- [`fabric:update_docs`][111]
- [`fabric_doc_update:go`][112]
- [`fabric_doc_update:handle_message`][113]
- [`fabric_rpc:update_docs`][114] (`WRITER`; via `rexi:cast`)
- [`fabric_rpc:make_att_readers`][115]
- [`fabric_rpc:make_att_reader`][116]
- [`couch_db:update_docs(Db, Docs0, Options,
replicated_changes)`][117]
- [`couch_db:doc_flush_atts`][118]
- [`couch_att:flush`][119]
- [`couch_att:flush_data(Db, Fun, Att) when
is_function(Fun)`][120]
- [`couch_att:write_streamed_attachment`][121]
- [`couch_att:read_next_chunk`][122]
[101]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1055-L1069
[102]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L461-L488
[103]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L26-L55
[104]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd.erl#L1058-L1069
[105]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L58-L77
[106]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L79-L122
[107]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L129-L191
[108]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1315-L1331
[109]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1343-L1385
[110]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric.erl#L275-L291
[111]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric.erl#L296-L310
[112]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L21-L54
[113]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L56-L102
[114]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L267-L280
[115]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L570-L576
[116]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578-L596
[117]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_db.erl#L1186-L1201
[118]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_db.erl#L1388-L1389
[119]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L532-L533
[120]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L540-L573
[121]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L604-L615
[122]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L617-L620
## Cluster setup
The cluster in question consists of 6 nodes spread over 2 data centres, with
3
nodes in each location. The database shard targeted by the `PUT` request is
replicated on 3 nodes, not all of which are in the same data centre.
Ping latency between nodes in the same datacentre is under 0.2ms, while
between
nodes in different data centres it's around 8ms.
## Observations
By instrumenting the above code paths with more logging, we have observed the
cluster executing the following sequence of events. These events happen over
two
requests, both are `PUT /db/doc` for the same doc ID and revision and have
attachments encoded as `multipart/related`.
- Request 1 is received by `HTTPD` on the coordinator node, which we'll call
node `A`. This hits the [`multipart/related`][201] clause; `W` is 2 and
`num_mp_writers` is set to 3. The shard targeted by the write is stored on
3
other nodes which we'll call `B`, `C` and `D`.
- This calls `couch_doc:doc_from_multi_part_stream`, which calls
`couch_httpd_multipart:decode_multipart_stream`, which [starts the `PARSER`
process][202]. This process uses a closure as a state machine to invoke
`mp_parse_doc`, `mp_parse_doc` and `maybe_send_data`. `HTTPD` [sends
`get_doc_bytes`][203] to `PARSER`, which [replies with `doc_bytes`][204].
This
causes the attachments to be [converted into `{follows, Parser, Ref}`
tuples][205].
- Meanwhile, `HTTPD` [starts the `FABRIC` process][206], which calls through
to
`fabric_doc_update:go`, which uses `rexi:cast` to forward the incoming doc
to
[`fabric_rpc:update_docs`][207] in the `WRITER` process. This [converts
`{follows, Parser, Ref}` into a `fun/0`][208], which is ultimately used in
[`flush_data`][209] via [`write_streamed_attachment`][210] and
[`read_next_chunk`][211].
- This `fun/0` [sends `get_bytes` to `PARSER`][212], which can handle it in
several places: [1][213], [2][214], [3][215]. (There is a lot of code
duplication between `mp_parse_atts` and `maybe_send_data` that includes
prioritising `get_bytes` over other messages.) `PARSER` [replies with
`bytes`][216]. The code around this reply manages a buffer of attachment
data;
once a chunk has been read by all the `WRITER` procs, it is discarded. The
parser expects 3 `WRITER` processes to request the data because
`num_mp_writers` is set to 3.
- When `WRITER` on nodes `B`, `C` or `D` receives `bytes` it [sends
`attachment_chunk_received` to `FABRIC`][217] on node `A`, which [handles
it][218] in `fabric_doc_update`.
- We observe one of the nodes, say node `B`, doing several rounds of
requesting
data from `PARSER` and acking it with `attachment_chunk_received` before
the
other two nodes `C` and `D` request anything. `PARSER` behaves correctly,
tracking which writers are waiting for data and how far through the stream
each writer has got; it does not prematurely drop chunks as we [previously
observed in the "middleman" attachment handler][229]. It appears `B` is
consuming the attachment much faster than `C` and `D`.
- While `C` and `D` are still fetching the attachment, we observe an [`ok`
message][219] from node `B` as it's finished reading the attachment and
completed the write. So we have had one ack but we need one more to achieve
quorum.
- At this point, an identical request, Request 2 is received by another
`HTTPD`
process, also running on node `A`. Everything proceeds as above, spawning
`PARSER` and `FABRIC` processes. The `PARSER` again sees `num_mp_writers`
is 3
and so expects 3 processes to fetch the data before it drops the buffered
chunks.
- In request 2, the first message that `FABRIC` receives is an `ok` from node
`B`. We observe `PARSER` receiving `get_bytes` from nodes `C` and `D`, and
those nodes send `attachment_chunk_received` to `FABRIC`. However, node `B`
never requests anything from `PARSER`, and no other signal is received by
`PARSER` that would tell it that one of the writers already has the
attachment
data and doesn't need to fetch it. No writer process is started on `B` so
there's not even a `DOWN` signal that `PARSER` could detect.
- Because of this, `PARSER` ends up waiting for a `get_bytes` message that
will
never arrive, and ends up hitting an `after 300000` timeout in its
`receive`
blocks in [`mp_parse_atts`][220] or [`maybe_send_data`][221].
- Up in `chttpd_db`, [`send_updated_doc` returns][222] quickly; request 2
still
achieves quorum from the rapid `ok` from node `B` and an evantual ack from
one
of the other nodes. However we then [call `WaitFun()`][223], which [waits
for
`PARSER` to exit][224]. This blocks request 2 from returning for 5 minutes,
even though the write has completed correctly.
[201]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1055
[202]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L29-L37
[203]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L38
[204]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L73
[205]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L481
[206]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1344-L1356
[207]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L267-L280
[208]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578-L596
[209]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L540-L573
[210]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L604-L615
[211]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L617-L620
[212]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L588
[213]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L101
[214]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L130
[215]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L184
[216]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L139
[217]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L590-L592
[218]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L70
[219]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L72
[220]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L119
[221]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L187
[222]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1062
[223]:
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1063
[224]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L483-L485
[225]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578
[226]:
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L597
[227]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L568-L572
[228]:
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L544-L567
[229]: https://github.com/apache/couchdb/pull/3715
## Comments
This situation is made more likely by the layout of the cluster; node `B` is
in
the same data centre as node `A` and so has much lower latency than nodes `C`
and `D`. So it's likely to complete the write much quicker than the other
nodes,
leaving a window where request 1 has one ack and doesn't have quorum yet when
request 2 starts.
Request 2 gets into this state because request 1 had partially succeeded;
one of
the nodes completes the write but we don't have quorum yet. It is worth
finding
out what would happen in request 2 if request 1 had had _two_ nodes complete
the
write when request 2 starts; even though this is enough for quorum,
`WaitFun()`
causes the request not to return until all 3 nodes have read the attachment
or
timed out. But, two acks is enough for quorum so it might put the doc in a
different state such that request 2 doesn't even try to write anything or
forward the attachment data.
For replication requests, this would not result in data loss, it's more of an
efficiency concern. Replication requests are allowed to add revs anywhere in
the
tree, so the second request should just be a no-op. For interactive edits,
the
first request will change what the leaves of the rev tree are and might
change
the winning rev, so could cause request 2 to be rejected -- again we should
examine what effect partial success of request 1 has here.
Finally we note that we currently have at least two methods for proxying
attachment data to backend nodes: the [`follows` format][225] from
`multipart/related` request, and the [`fabric_attachment_receiver`
format][226]
with the "middleman" process. There are also two different ways of invoking
the
intermediate funs depending on if the attachment length is [known][227] or
[unknown][228]. There's a lot of duplicate code for dealing with buffering
the
attachment data to multiple backends and we recently discovered and [fixed a
bug
in the "middleman" implementation][229], also discovered on this cluster. It
may
be worth designing a single implementation for this buffer system.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]