jcoglan opened a new issue #3939:
URL: https://github.com/apache/couchdb/issues/3939


   We have observed a cluster exhibiting a race condition in how `PUT /db/doc` 
with
   `Content-Type: multipart/related` transfers attachment data to backend 
nodes. If
   two such requests happen concurrently, it is possible for the second request 
to
   hit a `receive` timeout that blocks the request from responding for 5 
minutes.
   
   This issue was observed on replication requests with `new_edits=false` and 
part
   of the code path for this request differentiates between `replicated_changes`
   and `interactive_edit`. However we don't believe this problem is specific to
   replication requests.
   
   We do not believe any data is corrupted by this interaction, but this issue
   highlights some paths that are worth checking for data races.
   
   
   ## Relevant code
   
   The following represents the tree of function calls that are relevant to this
   issue; this trace references CouchDB v3.2.1. All-caps names indicate a 
process
   boundary; the `HTTPD`, `PARSER` and `FABRIC` processes execute on the
   coordinator node, while `WRITER` executes on the backend nodes holding the 
DB’s
   shard files.
   
   - (`HTTPD`)
     - [`chttpd_db:db_doc_req(#httpd{method='PUT', user_ctx=Ctx}=Req, Db,
       DocId)`][101]
       - [`couch_doc:doc_from_multi_part_stream`][102]
         - [`couch_httpd_multipart:decode_multipart_stream`][103]
           - `spawn_monitor` (`PARSER`)
             - [`couch_httpd:parse_multipart_request`][104]
             - [`couch_httpd_multipart:mp_parse_doc`][105]
             - [`couch_httpd_multipart:mp_parse_atts`][106]
             - [`couch_httpd_multipart:maybe_send_data`][107]
       - [`chttpd_db:send_updated_doc`][108]
         - [`chttpd_db:update_doc`][109]
           - `spawn_monitor` (`FABRIC`)
             - [`fabric:update_doc`][110]
               - [`fabric:update_docs`][111]
                 - [`fabric_doc_update:go`][112]
                   - [`fabric_doc_update:handle_message`][113]
                   - [`fabric_rpc:update_docs`][114] (`WRITER`; via `rexi:cast`)
                     - [`fabric_rpc:make_att_readers`][115]
                       - [`fabric_rpc:make_att_reader`][116]
                     - [`couch_db:update_docs(Db, Docs0, Options,
                       replicated_changes)`][117]
                       - [`couch_db:doc_flush_atts`][118]
                         - [`couch_att:flush`][119]
                           - [`couch_att:flush_data(Db, Fun, Att) when
                             is_function(Fun)`][120]
                             - [`couch_att:write_streamed_attachment`][121]
                               - [`couch_att:read_next_chunk`][122]
   
   [101]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1055-L1069
   [102]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L461-L488
   [103]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L26-L55
   [104]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd.erl#L1058-L1069
   [105]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L58-L77
   [106]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L79-L122
   [107]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L129-L191
   [108]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1315-L1331
   [109]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1343-L1385
   [110]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric.erl#L275-L291
   [111]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric.erl#L296-L310
   [112]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L21-L54
   [113]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L56-L102
   [114]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L267-L280
   [115]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L570-L576
   [116]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578-L596
   [117]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_db.erl#L1186-L1201
   [118]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_db.erl#L1388-L1389
   [119]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L532-L533
   [120]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L540-L573
   [121]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L604-L615
   [122]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L617-L620
   
   
   ## Cluster setup
   
   The cluster in question consists of 6 nodes spread over 2 data centres, with 
3
   nodes in each location. The database shard targeted by the `PUT` request is
   replicated on 3 nodes, not all of which are in the same data centre.
   
   Ping latency between nodes in the same datacentre is under 0.2ms, while 
between
   nodes in different data centres it's around 8ms.
   
   
   ## Observations
   
   By instrumenting the above code paths with more logging, we have observed the
   cluster executing the following sequence of events. These events happen over 
two
   requests, both are `PUT /db/doc` for the same doc ID and revision and have
   attachments encoded as `multipart/related`.
   
   - Request 1 is received by `HTTPD` on the coordinator node, which we'll call
     node `A`. This hits the [`multipart/related`][201] clause; `W` is 2 and
     `num_mp_writers` is set to 3. The shard targeted by the write is stored on 
3
     other nodes which we'll call `B`, `C` and `D`.
   
   - This calls `couch_doc:doc_from_multi_part_stream`, which calls
     `couch_httpd_multipart:decode_multipart_stream`, which [starts the `PARSER`
     process][202]. This process uses a closure as a state machine to invoke
     `mp_parse_doc`, `mp_parse_doc` and `maybe_send_data`. `HTTPD` [sends
     `get_doc_bytes`][203] to `PARSER`, which [replies with `doc_bytes`][204]. 
This
     causes the attachments to be [converted into `{follows, Parser, Ref}`
     tuples][205].
   
   - Meanwhile, `HTTPD` [starts the `FABRIC` process][206], which calls through 
to
     `fabric_doc_update:go`, which uses `rexi:cast` to forward the incoming doc 
to
     [`fabric_rpc:update_docs`][207] in the `WRITER` process. This [converts
     `{follows, Parser, Ref}` into a `fun/0`][208], which is ultimately used in
     [`flush_data`][209] via [`write_streamed_attachment`][210] and
     [`read_next_chunk`][211].
   
   - This `fun/0` [sends `get_bytes` to `PARSER`][212], which can handle it in
     several places: [1][213], [2][214], [3][215]. (There is a lot of code
     duplication between `mp_parse_atts` and `maybe_send_data` that includes
     prioritising `get_bytes` over other messages.) `PARSER` [replies with
     `bytes`][216]. The code around this reply manages a buffer of attachment 
data;
     once a chunk has been read by all the `WRITER` procs, it is discarded. The
     parser expects 3 `WRITER` processes to request the data because
     `num_mp_writers` is set to 3.
   
   - When `WRITER` on nodes `B`, `C` or `D` receives `bytes` it [sends
     `attachment_chunk_received` to `FABRIC`][217] on node `A`, which [handles
     it][218] in `fabric_doc_update`.
   
   - We observe one of the nodes, say node `B`, doing several rounds of 
requesting
     data from `PARSER` and acking it with `attachment_chunk_received` before 
the
     other two nodes `C` and `D` request anything. `PARSER` behaves correctly,
     tracking which writers are waiting for data and how far through the stream
     each writer has got; it does not prematurely drop chunks as we [previously
     observed in the "middleman" attachment handler][229]. It appears `B` is
     consuming the attachment much faster than `C` and `D`.
   
   - While `C` and `D` are still fetching the attachment, we observe an [`ok`
     message][219] from node `B` as it's finished reading the attachment and
     completed the write. So we have had one ack but we need one more to achieve
     quorum.
   
   - At this point, an identical request, Request 2 is received by another 
`HTTPD`
     process, also running on node `A`. Everything proceeds as above, spawning
     `PARSER` and `FABRIC` processes. The `PARSER` again sees `num_mp_writers` 
is 3
     and so expects 3 processes to fetch the data before it drops the buffered
     chunks.
   
   - In request 2, the first message that `FABRIC` receives is an `ok` from node
     `B`. We observe `PARSER` receiving `get_bytes` from nodes `C` and `D`, and
     those nodes send `attachment_chunk_received` to `FABRIC`. However, node `B`
     never requests anything from `PARSER`, and no other signal is received by
     `PARSER` that would tell it that one of the writers already has the 
attachment
     data and doesn't need to fetch it. No writer process is started on `B` so
     there's not even a `DOWN` signal that `PARSER` could detect.
   
   - Because of this, `PARSER` ends up waiting for a `get_bytes` message that 
will
     never arrive, and ends up hitting an `after 300000` timeout in its 
`receive`
     blocks in [`mp_parse_atts`][220] or [`maybe_send_data`][221].
   
   - Up in `chttpd_db`, [`send_updated_doc` returns][222] quickly; request 2 
still
     achieves quorum from the rapid `ok` from node `B` and an evantual ack from 
one
     of the other nodes. However we then [call `WaitFun()`][223], which [waits 
for
     `PARSER` to exit][224]. This blocks request 2 from returning for 5 minutes,
     even though the write has completed correctly.
   
   
   [201]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1055
   [202]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L29-L37
   [203]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L38
   [204]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L73
   [205]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L481
   [206]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1344-L1356
   [207]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L267-L280
   [208]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578-L596
   [209]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L540-L573
   [210]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L604-L615
   [211]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L617-L620
   [212]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L588
   [213]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L101
   [214]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L130
   [215]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L184
   [216]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L139
   [217]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L590-L592
   [218]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L70
   [219]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_doc_update.erl#L72
   [220]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L119
   [221]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_httpd_multipart.erl#L187
   [222]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1062
   [223]: 
https://github.com/apache/couchdb/blob/3.2.1/src/chttpd/src/chttpd_db.erl#L1063
   [224]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_doc.erl#L483-L485
   [225]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L578
   [226]: 
https://github.com/apache/couchdb/blob/3.2.1/src/fabric/src/fabric_rpc.erl#L597
   [227]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L568-L572
   [228]: 
https://github.com/apache/couchdb/blob/3.2.1/src/couch/src/couch_att.erl#L544-L567
   [229]: https://github.com/apache/couchdb/pull/3715
   
   
   ## Comments
   
   This situation is made more likely by the layout of the cluster; node `B` is 
in
   the same data centre as node `A` and so has much lower latency than nodes `C`
   and `D`. So it's likely to complete the write much quicker than the other 
nodes,
   leaving a window where request 1 has one ack and doesn't have quorum yet when
   request 2 starts.
   
   Request 2 gets into this state because request 1 had partially succeeded; 
one of
   the nodes completes the write but we don't have quorum yet. It is worth 
finding
   out what would happen in request 2 if request 1 had had _two_ nodes complete 
the
   write when request 2 starts; even though this is enough for quorum, 
`WaitFun()`
   causes the request not to return until all 3 nodes have read the attachment 
or
   timed out. But, two acks is enough for quorum so it might put the doc in a
   different state such that request 2 doesn't even try to write anything or
   forward the attachment data.
   
   For replication requests, this would not result in data loss, it's more of an
   efficiency concern. Replication requests are allowed to add revs anywhere in 
the
   tree, so the second request should just be a no-op. For interactive edits, 
the
   first request will change what the leaves of the rev tree are and might 
change
   the winning rev, so could cause request 2 to be rejected -- again we should
   examine what effect partial success of request 1 has here.
   
   Finally we note that we currently have at least two methods for proxying
   attachment data to backend nodes: the [`follows` format][225] from
   `multipart/related` request, and the [`fabric_attachment_receiver` 
format][226]
   with the "middleman" process. There are also two different ways of invoking 
the
   intermediate funs depending on if the attachment length is [known][227] or
   [unknown][228]. There's a lot of duplicate code for dealing with buffering 
the
   attachment data to multiple backends and we recently discovered and [fixed a 
bug
   in the "middleman" implementation][229], also discovered on this cluster. It 
may
   be worth designing a single implementation for this buffer system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to