chewbranca opened a new issue, #4076:
URL: https://github.com/apache/couchdb/issues/4076
There's a number of things going on, but a high level summary is that
opening a database handle has the possibility of returning `{error,
all_dbs_active}`, unfortunately, there are a number of internal subsystems that
ignore this error and do not account for it. We get away with it sometimes by
the caller crashing within a properly supervised hierarchy, which then gets
restarted without major incident. However, we're not always that lucky, and
there are scenarios where spawned processes that are not properly supervised
crash and partially propagate the failure, leaving the system in an unknown
impaired state with undefined behavior. Even worse, we don't have a quick and
easy way of determining all the places that this _might_ happen.
I'll give an example of a confirmed scenario where this bug induces a wedged
system state that requires a full reboot of CouchDB to fix. Then I'll point out
a handful of other areas where we're not handling `all_dbs_active` properly.
First though, I want to talk about one of the primary sources of this issue,
and that's the discrepancy within `couch_util:with_db` [1] where errors in
opening the db handled a propagated with `throw` rather than returning an
`{error, Reason}` [2], but nearly all callers look for the error tuple to be
returned. There are many callers of `couch_util:with_db` in the codebase, and
very few that expect or handle a thrown error. There are also places where the
caller checks the return value looking for potential errors, but that will only
encompass errors from the callback function passed to `couch_util:with_db`, not
the actual errors from loading the database handle itself.
So let's look at one of these scenarios: when a `smoosh_channel.erl`
gen_server process starts to compact a view, one of the first things it does is
to fetch a handle to the index state process in [3], by way of
`couch_index_server:get_index/3`. In the event there is not already a spawned
index process for that database view, the `smoosh_channel` process will get put
in a blocking `gen_server:call` into `couch_index_server` to get the index [4]
with an `infinity` timeout. In general this is not good, we should avoid making
`gen_server:calls` within a `gen_server` to prevent the process from getting
wedged, which is what happened in this scenario although still not the main
bug. Moving on, the `couch_index_server` gen_server handles the call for
`get_index` [5] and does a few things, it `spawn_links` a new process to invoke
`couch_index_server:new_index/2`, and it returns a `noreply` in the context of
the `gen_server:call` from the `smoosh_channel` process, in theory, to return
onc
e the index has been opened. The problem is the process spawned on function
`new_index` calls `couch_index:start_link/2` [6] which utilizes
`couch_util:with_db` in an unsafe manner in [7] by expecting the errors to
propagate by return value rather than bypassing the stack with `throw`.
This means that in [7] the `Resp = couch_util:with_db(DbName, fun(Db) ->` is
never actually achieved as the `throw({error,all_dbs_active})` is not caught,
therefore, it can never check `Resp` to trigger `proc_lib:init_ack({error,
all_dbs_active})`, which then also fails to propagate into the
`couch_index_server:new_index` logic to handle unexpected errors and return
those back to the parent `couch_index_server` gen_server process as the error
is thrown not returned in [8]. Now, the `proc_lib:start_link` documentation in
[9] states the following:
> If the process crashes before it has called init_ack/1,2, Ret = {error,
Reason} will be returned if the calling process traps exits.
So perhaps the intent here was that the `{error, all_dbs_active}` thrown by
way of `couch_util:with_db` would have that error propagated up in the return
value so it could be caught by the worker process for `new_index` in [8],
however, that process is spawned by the `couch_index_server` in [5] and while
`couch_index_server` does indeed `trap_exit`, the spawned process does not, and
neither does spawning the `couch_index` process. Therefore, this appears to
crash the spawned process by way of `new_index`. Now in [5], the
`couch_index_server` gen_server used a `spawn_link` to instantiate this worker,
so it's linked, but a reference to the pid itself is not saved, nor added to
the `by_pid` table for looking it up. The error is propagated to the
`handle_info` callback in [10], however, the pid is not in the `by_pid` table
and the exit `Reason` is not `normal`, so we hit the fallback that silently
ignores the error in [11].
At this point we've attempted to open the index but have failed to do so,
yet also failed to propagate the error properly and we've ignored the process
exit. Unfortunately, we still have the `smoosh_channel` caller blocked by way
of the `noreply` in [5], and because it's made a `gen_server:call` with a
timeout of `infinity`, it will be blocked permanently as the worker process
expected to return the data is long since dead and will never recover.
Therefore the `smoosh_channel` will also never recover, and will be completely
unresponsive as it's blocked awaiting the `gen_server:call` reply. Even worse,
in [5], a waiters entry was established for opening this index and now future
callers trying to open the index will see the waiters list and will also get
stuck there indefinitely.
This results in all view requests to that shard/view/node triple getting
blocked on opening an index that will never open. This also manifests as the
`/_system` endpoint throwing timeout errors as the `smoosh_channel` is blocked
and unresponsive, which triggers response time thresholds for generating the
`/_system` endpoint.
You could conceivably fix this live by way of remsh, but at this point the
only real recourse here is a full CouchDB reboot, especially since you do not
know what other areas of the system have wound themselves into undefined
behavior. We should definitely try and track down potential places where we
encounter things like this. It's unfortunate that OTP doesn't have a
`gen_worker` construct allowing short lived processes to be spawned and
properly managed.
On the front of finding places where we fail to account for the `{error,
all_dbs_active}` return value, I'm happy to say Dialyzer does a pretty solid
job of tracking various instances down! I know in earlier versions of Erlang
that Dialyzer has been challenging to parse, but I'm very happy to report we
get some solid output from Dialyzer now.
As a quick proof of concept, I made the following change to `couch_server`:
```diff
diff --git a/src/couch/src/couch_server.erl b/src/couch/src/couch_server.erl
index 3c72e3357..72bd51375 100644
--- a/src/couch/src/couch_server.erl
+++ b/src/couch/src/couch_server.erl
@@ -90,6 +90,9 @@ get_spidermonkey_version() ->
list_to_binary(?COUCHDB_SPIDERMONKEY_VERSION).
sup_start_link(N) ->
gen_server:start_link({local, couch_server(N)}, couch_server, [N], []).
+-spec open(any(), [any()]) ->
+ {ok, any()} |
+ {error, all_dbs_active}.
open(DbName, Options) ->
try
validate_open_or_create(DbName, Options),
```
and with that update alone, Dialyzer found the following callers failing to
handle the `{error, all_dbs_active}` (and note, there are other error return
types here too, I'm just focused on `all_dbs_active` right now):
```
src/couch_db_split.erl:0: in couch_db_split:split/3[57-77]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/couch_db_split.erl:0: in couch_db_split:copy_local_docs/3[78-111]:The
pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/couch_db_split.erl:0: in couch_db_split:cleanup_target/2[112-126]:The
pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/cpse_test_open_close_delete.erl:0: in
cpse_test_open_close_delete:cpse_open_non_existent/1[32-38]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in
cpse_test_open_close_delete:cpse_open_create/1[39-45]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in
cpse_test_open_close_delete:cpse_open_when_exists/1[46-52]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/cpse_test_open_close_delete.erl:0: in
cpse_test_open_close_delete:cpse_terminate/1[53-59]:Guard test
{'error', 'all_dbs_active'} | {'ok', _} =:=
__X :: {'not_found', 'no_db_file'} can never succeed
src/dreyfus_rpc.erl:0: in dreyfus_rpc:get_or_create_db/2[104-113]:The
pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'} | {'ok', _}
src/mem3_rep.erl:0: in mem3_rep:go/1[106-132]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_rep.erl:0: in mem3_rep:with_src_db/2[711-723]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/mem3_reshard_job.erl:0: in
mem3_reshard_job:wait_source_close_impl/1[499-516]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:ensure_exists/1[294-305]:The pattern
'file_exists' can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_or_create_db/2[548-564]:The pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_or_create_db_int/2[565-582]:The
pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'}
src/mem3_util.erl:0: in mem3_util:get_shard_props/1[589-610]:The pattern
{'not_found', _} can never match the type
{'error', 'all_dbs_active'}
src/hastings_rpc.erl:0: in hastings_rpc:get_or_create_db/2[111-120]:The
pattern
{'not_found', 'no_db_file'} can never match the type
{'error', 'all_dbs_active'} | {'ok', _}
```
This is not an exhaustive list but it's an excellent start, and this makes
me very optimistic for further Dialyzer enhancements.
I've also searched through logs and found that the `all_dbs_active` errors
manifest in most of the different service endpoints and result in runtime
errors and crashes. I need to sanitize the logs and I'm low on time today, so
I'll circle back around on that later, but a quick summary is that I've found
`all_dbs_active` crashes in the following subsystems: map views, reduce views,
search queries, internal replication, security doc acquisition, doc updates,
view compaction (and most likely database compaction), couch_index supervisor
hierarchy crashes, all_docs queries, changes queries, and get revs.
[1]
https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L558-L575
[2]
https://github.com/apache/couchdb/blob/main/src/couch/src/couch_util.erl#L566-L567
[3]
https://github.com/apache/couchdb/blob/main/src/smoosh/src/smoosh_channel.erl#L476-L478
[4]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L123-L124
[5]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L156-L159
[6]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L243
[7]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index.erl#L76-L105
[8]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L249-L252
[9] https://www.erlang.org/doc/man/proc_lib.html#start_link-5
[10]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L197-L211
[11]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L208-L209
[12]
https://github.com/apache/couchdb/blob/main/src/couch_index/src/couch_index_server.erl#L160-L162
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]