## Overview

There are two separate bugs here being fixed. The first was in 
`couch_server:terminate/2`. If there was an active open_async process when 
couch_server died it would throw a function clause when it tried to shutdown 
the database that wasn't yet open. We just filter out any `#entry{}` record to 
avoid the issue. This a simple bug but it prevented couch_server from properly 
generating a SASL report that ended up covering up the second more complex race 
condition.

The second bug has to deal with a race between deletions and opens as well as 
the state of couch_server's mailbox. There's a very specific set of operations 
that have to happen in a particular order to trigger this bug:

1. An in-flight open or creation request that will ultimately succeed. The 
`'$gen_call'` message must be in couch_server's message queue before Step 2 
finishes.
2. A deletion request
3. A second open or creation request that is enqueued before couch_server 
processes the `'$gen_call'` message from Step 1.
4. The couch_db_updater pid's `'EXIT'` signal is delayed until after the 
open_result from Step 3 is delivered to couch_server

The underlying issue here is that the deletion request clears the `couch_dbs` 
ets table state without removing the possible corresponding state in 
couch_server's message queue. As things progress couch_server ends up mistaking 
the `open_result` message from Step 1 as corresponding to the open_async 
process spawned in Step 3. Currently this results in the client from Step 3 
getting an invalid response from couch_server, and then more importantly, 
couch_server dies while attempting to process the real `open_result` message 
because of the state in the `couch_dbs` ets table being incorrect (it would 
throw a function_clause error in a list comprehension because `#entry.waiters` 
was undefined).

The fix was just to ensure that the opener pid in the `#entry{}` record matches 
the pid of the caller. Anything that doesn't match is ignored since the 
deletion already cleaned up the server state. Although we do kill the 
couch_db_updater pid for extra security, Though technically its duplicating 
work that the deletion request already handled (via killing the `open_async` 
pid which is linked to it).

## Testing recommendations

The second of three commits adds a failing test case that is fixed by the third 
commit.

`make eunit`

## Checklist

- [x] Code is written and works correctly;
- [x] Changes are covered by tests;
- [x] No documentation to change


[ Full content available at: https://github.com/apache/couchdb/pull/1596 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to