[GitHub] [couchdb] chewbranca opened a new issue, #4076: Unexpected all_dbs_active errors cause unexpected behavior on internal callers

GitBox Tue, 21 Jun 2022 17:27:26 -0700


chewbranca opened a new issue, #4076:
URL: https://github.com/apache/couchdb/issues/4076


   There's a number of things going on, but a high level summary is that 
opening a database handle has the possibility of returning `{error, 
all_dbs_active}`, unfortunately, there are a number of internal subsystems that 
ignore this error and do not account for it. We get away with it sometimes by 
the caller crashing within a properly supervised hierarchy, which then gets 
restarted without major incident. However, we're not always that lucky, and 
there are scenarios where spawned processes that are not properly supervised 
crash and partially propagate the failure, leaving the system in an unknown 
impaired state with undefined behavior. Even worse, we don't have a quick and 
easy way of determining all the places that this _might_ happen.
   
   I'll give an example of a confirmed scenario where this bug induces a wedged 
system state that requires a full reboot of CouchDB to fix. Then I'll point out 
a handful of other areas where we're not handling `all_dbs_active` properly. 
First though, I want to talk about one of the primary sources of this issue, 
and that's the discrepancy within `couch_util:with_db` [1] where errors in 
opening the db handled a propagated with `throw` rather than returning an 
`{error, Reason}` [2], but nearly all callers look for the error tuple to be 
returned. There are many callers of `couch_util:with_db` in the codebase, and 
very few that expect or handle a thrown error. There are also places where the 
caller checks the return value looking for potential errors, but that will only 
encompass errors from the callback function passed to `couch_util:with_db`, not 
the actual errors from loading the database handle itself.
   
   So let's look at one of these scenarios: when a `smoosh_channel.erl` 
gen_server process starts to compact a view, one of the first things it does is 
to fetch a handle to the index state process in [3], by way of 
`couch_index_server:get_index/3`. In the event there is not already a spawned 
index process for that database view, the `smoosh_channel` process will get put 
in a blocking `gen_server:call` into `couch_index_server` to get the index [4] 
with an `infinity` timeout. In general this is not good, we should avoid making 
`gen_server:calls` within a `gen_server` to prevent the process from getting 
wedged, which is what happened in this scenario although still not the main 
bug. Moving on, the `couch_index_server` gen_server handles the call for 
`get_index` [5] and does a few things, it `spawn_links` a new process to invoke 
`couch_index_server:new_index/2`, and it returns a `noreply` in the context of 
the `gen_server:call` from the `smoosh_channel` process, in theory, to return 
onc
 e the index has been opened. The problem is the process spawned on function 
`new_index` calls `couch_index:start_link/2` [6] which utilizes 
`couch_util:with_db` in an unsafe manner in [7] by expecting the errors to 
propagate by return value rather than bypassing the stack with `throw`.
   
   This means that in [7] the `Resp = couch_util:with_db(DbName, fun(Db) ->` is 
never actually achieved as the `throw({error,all_dbs_active})` is not caught, 
therefore, it can never check `Resp` to trigger `proc_lib:init_ack({error, 
all_dbs_active})`, which then also fails to propagate into the 
`couch_index_server:new_index` logic to handle unexpected errors and return 
those back to the parent `couch_index_server` gen_server process as the error 
is thrown not returned in [8]. Now, the `proc_lib:start_link` documentation in 
[9] states the following:
   
   > If the process crashes before it has called init_ack/1,2, Ret = {error, 
Reason} will be returned if the calling process traps exits.
   
   So perhaps the intent here was that the `{error, all_dbs_active}` thrown by 
way of `couch_util:with_db` would have that error propagated up in the return 
value so it could be caught by the worker process for `new_index` in [8], 
however, that process is spawned by the `couch_index_server` in [5] and while 
`couch_index_server` does indeed `trap_exit`, the spawned process does not, and 
neither does spawning the `couch_index` process. Therefore, this appears to 
crash the spawned process by way of `new_index`. Now in [5], the 
`couch_index_server` gen_server used a `spawn_link` to instantiate this worker, 
so it's linked, but a reference to the pid itself is not saved, nor added to 
the `by_pid` table for looking it up. The error is propagated to the 
`handle_info` callback in [10], however, the pid is not in the `by_pid` table 
and the exit `Reason` is not `normal`, so we hit the fallback that silently 
ignores the error in [11].
   
   At this point we've attempted to open the index but have failed to do so, 
yet also failed to propagate the error properly and we've ignored the process 
exit. Unfortunately, we still have the `smoosh_channel` caller blocked by way 
of the `noreply` in [5], and because it's made a `gen_server:call` with a 
timeout of `infinity`, it will be blocked permanently as the worker process 
expected to return the data is long since dead and will never recover. 
Therefore the `smoosh_channel` will also never recover, and will be completely 
unresponsive as it's blocked awaiting the `gen_server:call` reply. Even worse, 
in [5], a waiters entry was established for opening this index and now future 
callers trying to open the index will see the waiters list and will also get 
stuck there indefinitely.
   
   This results in all view requests to that shard/view/node triple getting 
blocked on opening an index that will never open. This also manifests as the 
`/_system` endpoint throwing timeout errors as the `smoosh_channel` is blocked 
and unresponsive, which triggers response time thresholds for generating the 
`/_system` endpoint.
   
   You could conceivably fix this live by way of remsh, but at this point the 
only real recourse here is a full CouchDB reboot, especially since you do not 
know what other areas of the system have wound themselves into undefined 
behavior. We should definitely try and track down potential places where we 
encounter things like this. It's unfortunate that OTP doesn't have a 
`gen_worker` construct allowing short lived processes to be spawned and 
properly managed.
   
   On the front of finding places where we fail to account for the `{error, 
all_dbs_active}` return value, I'm happy to say Dialyzer does a pretty solid 
job of tracking various instances down! I know in earlier versions of Erlang 
that Dialyzer has been challenging to parse, but I'm very happy to report we 
get some solid output from Dialyzer now.
   
   As a quick proof of concept, I made the following change to `couch_server`:
   
   ```diff
   diff --git a/src/couch/src/couch_server.erl b/src/couch/src/couch_server.erl
   index 3c72e3357..72bd51375 100644
   --- a/src/couch/src/couch_server.erl
   +++ b/src/couch/src/couch_server.erl
   @@ -90,6 +90,9 @@ get_spidermonkey_version() -> 
list_to_binary(?COUCHDB_SPIDERMONKEY_VERSION).
    sup_start_link(N) ->
        gen_server:start_link({local, couch_server(N)}, couch_server, [N], []).
    
   +-spec open(any(), [any()]) ->
   +    {ok, any()} |
   +    {error, all_dbs_active}.
    open(DbName, Options) ->
        try
            validate_open_or_create(DbName, Options),
   ```
   
   and with that update alone, Dialyzer found the following callers failing to 
handle the `{error, all_dbs_active}` (and note, there are other error return 
types here too, I'm just focused on `all_dbs_active` right now):
   
   ```
   src/couch_db_split.erl:0: in couch_db_split:split/3[57-77]:The pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   src/couch_db_split.erl:0: in couch_db_split:copy_local_docs/3[78-111]:The 
pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   src/couch_db_split.erl:0: in couch_db_split:cleanup_target/2[112-126]:The 
pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   
   src/cpse_test_open_close_delete.erl:0: in 
cpse_test_open_close_delete:cpse_open_non_existent/1[32-38]:Guard test 
             {'error', 'all_dbs_active'} | {'ok', _} =:= 
             __X :: {'not_found', 'no_db_file'} can never succeed
   
   
   src/cpse_test_open_close_delete.erl:0: in 
cpse_test_open_close_delete:cpse_open_create/1[39-45]:Guard test 
             {'error', 'all_dbs_active'} | {'ok', _} =:= 
             __X :: {'not_found', 'no_db_file'} can never succeed
   
   
   src/cpse_test_open_close_delete.erl:0: in 
cpse_test_open_close_delete:cpse_open_when_exists/1[46-52]:Guard test 
             {'error', 'all_dbs_active'} | {'ok', _} =:= 
             __X :: {'not_found', 'no_db_file'} can never succeed
   
   
   src/cpse_test_open_close_delete.erl:0: in 
cpse_test_open_close_delete:cpse_terminate/1[53-59]:Guard test 
             {'error', 'all_dbs_active'} | {'ok', _} =:= 
             __X :: {'not_found', 'no_db_file'} can never succeed
   
   
   src/dreyfus_rpc.erl:0: in dreyfus_rpc:get_or_create_db/2[104-113]:The 
pattern 
             {'not_found', 'no_db_file'} can never match the type 
             {'error', 'all_dbs_active'} | {'ok', _}
   
   
   src/mem3_rep.erl:0: in mem3_rep:go/1[106-132]:The pattern 
             {'not_found', 'no_db_file'} can never match the type 
             {'error', 'all_dbs_active'}
   
   
   src/mem3_rep.erl:0: in mem3_rep:with_src_db/2[711-723]:The pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   
   src/mem3_reshard_job.erl:0: in 
mem3_reshard_job:wait_source_close_impl/1[499-516]:The pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   src/mem3_util.erl:0: in mem3_util:ensure_exists/1[294-305]:The pattern 
             'file_exists' can never match the type 
             {'error', 'all_dbs_active'}
   
   
   src/mem3_util.erl:0: in mem3_util:get_or_create_db/2[548-564]:The pattern 
             {'not_found', 'no_db_file'} can never match the type 
             {'error', 'all_dbs_active'}
   
   src/mem3_util.erl:0: in mem3_util:get_or_create_db_int/2[565-582]:The 
pattern 
             {'not_found', 'no_db_file'} can never match the type 
             {'error', 'all_dbs_active'}
   
   
   src/mem3_util.erl:0: in mem3_util:get_shard_props/1[589-610]:The pattern 
             {'not_found', _} can never match the type 
             {'error', 'all_dbs_active'}
   
   src/hastings_rpc.erl:0: in hastings_rpc:get_or_create_db/2[111-120]:The 
pattern 
             {'not_found', 'no_db_file'} can never match the type 
             {'error', 'all_dbs_active'} | {'ok', _}
   ```
   
   This is not an exhaustive list but it's an excellent start, and this makes 
me very optimistic for further Dialyzer enhancements.
   
   I've also searched through logs and found that the `all_dbs_active` errors 
manifest in most of the different service endpoints and result in runtime 
errors and crashes. I need to sanitize the logs and I'm low on time today, so 
I'll circle back around on that later, but a quick summary is that I've found 
`all_dbs_active` crashes in the following subsystems: map views, reduce views, 
search queries, internal replication, security doc acquisition, doc updates, 
view compaction (and most likely database compaction), couch_index supervisor 
hierarchy crashes, all_docs queries, changes queries, and get revs.
   
   
   
   [1] 
https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L558-L575
   [2] 
https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L566-L567
   
   [3] 
https://github.com/apache/couchdb/blob/main/src/smoosh/src/smoosh_channel.erl#L476-L478
   [4] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L123-L124
   [5] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L156-L159
   [6] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L243
   [7] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index.erl#L76-L105
   [8] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L249-L252
   [9] https://www.erlang.org/doc/man/proc_lib.html#start_link-5
   [10] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L197-L211
   [11] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L208-L209
   [12] 
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L160-L162


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [couchdb] chewbranca opened a new issue, #4076: Unexpected all_dbs_active errors cause unexpected behavior on internal callers

Reply via email to