On Oct 27, 2009, at 7:49 PM, eric casteleijn wrote:
We have the following setup:
2 near identical public facing django servers communicating with one
couchdb server. The couchdb server is oauth authenticated and people
can access it directly (well, through an apache proxy) if they have
the tokens to do so. New users are signed up through these django
servers, after which they add the user and their tokens to couchdb.
(the user through a POST to _users and the tokens through PUTs to
_config)
We see this failing a lot, now to the point where we think it fails
all the time (since all those systems have separate logs not all of
which we have access to, this is not trivial to piece together.)
The errors the API servers get back all look like these (the lines
starting with '(500':
'2009-10-27 22:35:15,357 ERROR UbuntuOne.couch: failed to add
***** = 40693 to section [oauth_token_users] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,
\n {set,"oauth_token_users","*****","40693",true}]}'))'
'2009-10-27 22:35:20,399 ERROR UbuntuOne.couch: failed to add
***** = ***** to section [oauth_token_secrets] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,
\n {set,"oauth_token_secrets","*****",\n "*****",
\n true}]}'))'
Corresponding errors in the couchdb.log look like:
My theory was that these writes to _config fail because the
local.ini is somehow corrupted, but I can't access that file
directly (since it has users' secrets) or copy it to my machine to
test this theory, and helping someone who is allowed to see it look
for anything weird is like searching for the proverbial needle in
the haystack: we have lots of users, and users can have multiple
tokens. Add to that the fact that you cannot ever delete a line from
the .ini file (DELETEs against keys in _config just empty the value
and leave a line like 'foo = \n'!
After speaking to Jan on the channel he proposed that it may be that
the gen_server message inbox overflows and the gen_server times out.
Could that be, under high load, and how can we solve this? Can we
increase the size of this inbox, or can we possibly have multiple
processes handling the access? Whether it's high load or corruption
or something else again, right now it looks like NO new tokens can
be added, and hence no new users can use our system. In short: HALP!
Hi Eric, I think we all know the long term solution is to store oauth
information in a DB instead of the config file. Barring that, in the
short term some steps that can be taken to avoid these errors include
1) extending or disabling the couch_config gen_server timeout. The
default is 5000 milliseconds. This is a one-line patch.
2) Writing to the .ini file asynchronously. The in-memory
configuration state can sustain update rates that are orders (plural)
of magnitude larger than the update rate for the .ini file itself.
With a bit of work you could cook it so that you still didn't respond
to the PUT /_config/... request until the update was actually written
to the file, while at the same time freeing the config server to
handle more requests.
In each case the response times for PUT/_config/... may become
uncomfortably long, but at least you won't be serving 500s from couch.
Best, Adam