## Overview
There are two separate bugs here being fixed. The first was in
`couch_server:terminate/2`. If there was an active open_async process when
couch_server died it would throw a function clause when it tried to shutdown
the database that wasn't yet open. We just filter out any `#entry{}` record to
avoid the issue. This a simple bug but it prevented couch_server from properly
generating a SASL report that ended up covering up the second more complex race
condition.
The second bug has to deal with a race between deletions and opens as well as
the state of couch_server's mailbox. There's a very specific set of operations
that have to happen in a particular order to trigger this bug:
1. An in-flight open or creation request that will ultimately succeed. The
`'$gen_call'` message must be in couch_server's message queue before Step 2
finishes.
2. A deletion request
3. A second open or creation request that is enqueued before couch_server
processes the `'$gen_call'` message from Step 1.
4. The couch_db_updater pid's `'EXIT'` signal is delayed until after the
open_result from Step 3 is delivered to couch_server
The underlying issue here is that the deletion request clears the `couch_dbs`
ets table state without removing the possible corresponding state in
couch_server's message queue. As things progress couch_server ends up mistaking
the `open_result` message from Step 1 as corresponding to the open_async
process spawned in Step 3. Currently this results in the client from Step 3
getting an invalid response from couch_server, and then more importantly,
couch_server dies while attempting to process the real `open_result` message
because of the state in the `couch_dbs` ets table being incorrect (it would
throw a function_clause error in a list comprehension because `#entry.waiters`
was undefined).
The fix was just to ensure that the opener pid in the `#entry{}` record matches
the pid of the caller. Anything that doesn't match is ignored since the
deletion already cleaned up the server state. Although we do kill the
couch_db_updater pid for extra security, Though technically its duplicating
work that the deletion request already handled (via killing the `open_async`
pid which is linked to it).
## Testing recommendations
The second of three commits adds a failing test case that is fixed by the third
commit.
`make eunit`
## Checklist
- [x] Code is written and works correctly;
- [x] Changes are covered by tests;
- [x] No documentation to change
[ Full content available at: https://github.com/apache/couchdb/pull/1596 ]
This message was relayed via gitbox.apache.org for [email protected]