jcoglan opened a new issue, #4870: URL: https://github.com/apache/couchdb/issues/4870
We (@neighboorhoodie, @janl, myself) have observed some incorrect state conditions in `ibrowse` that lead to incorrect parsing of some HTTP responses. This is most likely to occur when a connection has been interrupted or a response not fully consumed, and it can lead to an `ibrowse` process retaining headers from previous responses and therefore incorrectly parsing later ones. In particular it can lead to non-chunked responses being parsed as chunk ones, which leads to an error where [`hexlist_to_integer`](https://github.com/cmullaparthi/ibrowse/blob/v4.4.2/src/ibrowse_http_client.erl#L2008-L2014) and [`to_ascii`](https://github.com/cmullaparthi/ibrowse/blob/v4.4.2/src/ibrowse_http_client.erl#L2016-L2037) are called with bad input. ## Description We observed this when the following error showed up in customer logs: [error] 2023-11-20T00:14:06.509562Z nonode@nohost <0.3929.5131> -------- gen_server <0.3929.5131> terminated with reason: no function clause matching ibrowse_http_client:to_ascii(114)(line:1794) at ibrowse_http_client:hexlist_to_integer/3(line:1790) <= ibrowse_http_client:parse_11_response/2(line:1218) <= ibrowse_http_client:handle_sock_data/2(line:320) <= gen_server:try_dispatch/4(line:616) <= gen_server:handle_msg/6(line:686) <= proc_lib:init_p_do_apply/3(line:247) `ibrowse_http_client:to_ascii` is being called with the integer `114`, which denotes the letter `r` in ASCII, which is not a valid hex digit, so the call fails. The `gen_server` error includes the last message received by the process: {tcp,#Port<0.17503260>, <<"HTTP/1.1 500 Internal Server Error\r\n Cache-Control: must-revalidate\r\n Content-Length: 65\r\n Content-Type: application/json\r\n Date: Mon, 20 Nov 2023 00:14:06 GMT\r\n Server: CouchDB/2.3.1 (Erlang OTP/20)\r\n X-Couch-Request-ID: 5cc908776b\r\n X-Couch-Stack-Hash: 3377714225\r\n X-CouchDB-Body-Time: 0\r\n \r\n {\"error\":\"badmatch\",\"reason\":\"{'EXIT',noproc}\",\"ref\":3377714225}\n">>} The error stacktrace passes through a [clause of `parse_11_response`](https://github.com/cmullaparthi/ibrowse/blob/v4.4.2/src/ibrowse_http_client.erl#L1392-L1417) that handles chunked encoding, but the above response message does not have a `Transfer-Encoding` header. This is being invoked because it's reacting to the `gen_server` _state_, which in this case included the following header list: [ {"Content-Type","multipart/mixed; boundary=\"7f6cd44f44a5d757879834cf6e8530f7\""}, {"Date","Mon, 20 Nov 2023 00:13:34 GMT"}, {"Server","CouchDB/2.3.1 (Erlang OTP/20)"}, {"Transfer-Encoding","chunked"} ] (The full state object is quite large, but includes some other values indicating chunked encoding. We can provide the full state with client info redacted if needed.) These headers do not match those of the incoming message. An HTTP client is expected to handle streamed data, and so needs to keep some state to record headers it's seen so far. However the incoming message is a complete response, not a chunk of a stream, and so it should be parsed from a clean starting state. Our client has reported that they're seeing a lot of partial/interrupted responses, and our best theory for why this is happening is that an `ibrowse_http_client` instance is partially processing a response, which is interrupted, and the process is then reused for another request without being cleaned up first. Two main scenarios seem likely causes for this: - A code path that retries requests: a client makes one request, whose response is partially received, and the client then retries the request using the same `ibrowse_http_client` process, leading to confused state. For example, [`couch_replicator_api_wrap:open_doc_revs`](https://github.com/apache/couchdb/blob/3.3.2/src/couch_replicator/src/couch_replicator_api_wrap.erl#L279-L381) performs retries, and the above headers indicating a chunked multipart response are consistent with fetching a doc with `open_revs=all`. The `gen_server` state also indicates this is the kind of request being made. - Reuse of a client from the pool without resetting its state. The [`release`](https://github.com/apache/couchdb/blob/3.3.2/src/couch_replicator/src/couch_replicator_connection.erl#L172-L184) handler in `couch_replicator_connection` demonitors the worker and sets its `mref` to `undefined`, but otherwise sends no signal to the worker to reset its state. This could lead to a client being reused when it still contains headers from a previous response. ## Steps to Reproduce This is tricky to reproduce mechanically because it requires getting responses interrupted in order to trigger the bad state. We have only observed its effects in customer logs and then diagnosed possible causes by code review. ## Expected Behaviour When reusing an `ibrowse` process either immediately or by releasing it back to the pool, its state should be reset so that it does not conflate the state from multiple responses. If this cannot be done then the connection should not be reused. ## Your Environment * CouchDB version used: observed in production on 2.3.1, but code review makes us believe this is also possible in 3.3.2 * Browser name and version: N/A * Operating system and version: Ubuntu 20.04 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
