We have the following setup:
2 near identical public facing django servers communicating with one
couchdb server. The couchdb server is oauth authenticated and people can
access it directly (well, through an apache proxy) if they have the
tokens to do so. New users are signed up through these django servers,
after which they add the user and their tokens to couchdb. (the user
through a POST to _users and the tokens through PUTs to _config)
We see this failing a lot, now to the point where we think it fails all
the time (since all those systems have separate logs not all of which we
have access to, this is not trivial to piece together.)
The errors the API servers get back all look like these (the lines
starting with '(500':
'2009-10-27 22:35:15,357 ERROR UbuntuOne.couch: failed to add ***** =
40693 to section [oauth_token_users] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n
{set,"oauth_token_users","*****","40693",true}]}'))'
'2009-10-27 22:35:20,399 ERROR UbuntuOne.couch: failed to add ***** =
***** to section [oauth_token_secrets] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n
{set,"oauth_token_secrets","*****",\n
"*****",\n true}]}'))'
Corresponding errors in the couchdb.log look like:
My theory was that these writes to _config fail because the local.ini is
somehow corrupted, but I can't access that file directly (since it has
users' secrets) or copy it to my machine to test this theory, and
helping someone who is allowed to see it look for anything weird is like
searching for the proverbial needle in the haystack: we have lots of
users, and users can have multiple tokens. Add to that the fact that you
cannot ever delete a line from the .ini file (DELETEs against keys in
_config just empty the value and leave a line like 'foo = \n'!
After speaking to Jan on the channel he proposed that it may be that the
gen_server message inbox overflows and the gen_server times out.
Could that be, under high load, and how can we solve this? Can we
increase the size of this inbox, or can we possibly have multiple
processes handling the access? Whether it's high load or corruption or
something else again, right now it looks like NO new tokens can be
added, and hence no new users can use our system. In short: HALP!
--
- eric casteleijn
https://launchpad.net/~thisfred
http://www.canonical.com