Today our production couchdb suddenly crashed and refused to restart, (scrubbed end of the logfile included.) After asking on #couchdb, Jan kindly helped me understand that this was due to running into the default limit of allowed erlang processes (32K and a bit.) and suggested increasing the number.

While I have no problem doing that if it solves the problem, I am still left with some questions that I hope you all can help me answer.

I'll first sketch the set-up of our system very globabally, and tell you what we were doing at the time that might have had an impact:

We have a single server node, with a lot of clients replicating to and from databases it contains, and a web interface reading from and writing to it. The server node contains somewhere between 200K and 300K databases.

I was rather naively doing collection some stats from the server by running something like the following pseudo code:

all_dbs = GET "/_all_dbs"
for db in all_dbs:
    sleep for a little bit
    GET "/" + db

to get the number of databases and the total number of documents in all databases. After some minutes of running this, the crash occured. This might not be related to us running the script, but that would be one hell of a coincedence, as we haven't seen this particular error before. Also after aborting the script, couchdb seemed to slowly recuperate.

My questions in declining order of ulcerinducingness:

1. Can anyone explain why the above would cause CouchDB to run out of processes?

2. Are there more scenarios like this that can cause such crashes?

3. Is there a way to monitor the number of processes in use? (Ideally I would like to have it in _stats, although I don't know how feasible that is.)

4. If this is not an outright bug in CouchDB, is there any way to degrade a little more gracefully in cases like this? It reminds me of the errors CouchDB throws when running out of file descriptors. I would things to slow down and maybe time out on connections, rather than cause crashes, or (in the case of file descriptors) return server errors.

--
- eric casteleijn
https://launchpad.net/~thisfred
http://www.canonical.com
[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>] 'HEAD' /u%2F053%2F2a4%2F101325%2Fnotes {1,1}
Headers: [{'Accept',"application/json"},
          {'Accept-Encoding',"compress, gzip"},
          {'Authorization',"OAuth realm=\"\", oauth_nonce=\"76187806\", oauth_timestamp=\"1257868744\", oauth_consumer_key=\"ubuntuone\", oauth_signature_method=\"HMAC-SHA1\", oauth_version=\"1.0\", oauth_token=\"*****\", oauth_signature=\"*****\""},
          {'Connection',"Keep-Alive"},
          {'Host',"couchdb.one.ubuntu.com"},
          {'User-Agent',"couchdb-python 0.6"},
          {'Via',"1.1 couchdb.one.ubuntu.com"},
          {'X-Forwarded-For',"147.102.133.18"},
          {"X-Forwarded-Host","couchdb.one.ubuntu.com"},
          {"X-Forwarded-Server","couchdb.one.ubuntu.com"},
          {"X-Forwarded-Ssl","on"}]

[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>] OAuth Params: [{"oauth_nonce","76187806"},
               {"oauth_timestamp","1257868744"},
               {"oauth_consumer_key","ubuntuone"},
               {"oauth_signature_method","HMAC-SHA1"},
               {"oauth_version","1.0"},
               {"oauth_token","*****"},
               {"oauth_signature","*****"}]

[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.23291.1877>] request_group {Pid, Seq} {<0.1222.1877>,105002}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [emulator] Too many processes



[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.11658.1881>] 'GET' /u%2F043%2F3d6%2F128956%2Fcontacts {1,1}
Headers: [{'Accept-Encoding',"identity"},
          {'Authorization',"Basic *****"},
          {'Content-Type',"application/json"},
          {'Host',"*****:9030"},
          {'User-Agent',"couchdb minimal http interface"}]

[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.11658.1881>] OAuth Params: []

[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.22836.1877>] 'GET' /u%2Fad1%2F288%2F12708%2Fnotes/_local%2F06b93d053bca1fb7402800b657ba29bd {1,
                                                                                1}
Headers: [{'Accept',"application/json"},
          {'Accept-Encoding',"gzip"},
          {'Authorization',"OAuth oauth_signature=\"*****\", oauth_token=\"*****\", oauth_version=\"1.0\", oauth_nonce=\"*****\", oauth_timestamp=\"1257868739\", oauth_signature_method=\"HMAC-SHA1\", oauth_consumer_key=\"ubuntuone\""},
          {'Connection',"Keep-Alive"},
          {'Host',"couchdb.one.ubuntu.com:443"},
          {'User-Agent',"CouchDB/0.10.0"},
          {'Via',"1.1 couchdb.one.ubuntu.com"},
          {'X-Forwarded-For',"*****"},
          {"X-Forwarded-Host","couchdb.one.ubuntu.com:443"},
          {"X-Forwarded-Server","couchdb.one.ubuntu.com"},
          {"X-Forwarded-Ssl","on"}]

[Tue, 10 Nov 2009 15:58:59 GMT] [debug] [<0.22836.1877>] OAuth Params: [{"oauth_signature","*****"},
               {"oauth_token","*****"},
               {"oauth_version","1.0"},
               {"oauth_nonce","*****"},
               {"oauth_timestamp","1257868739"},
               {"oauth_signature_method","HMAC-SHA1"},
               {"oauth_consumer_key","ubuntuone"}]

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.785.1877>] ** Generic server couch_server terminating
** Last message in was {open,<<"u/053/2a4/101325/notes">>,
                             [{user_ctx,{user_ctx,<<"101325">>,[]}}]}
** When Server state == {server,"/srv/couchdb/database",
                            {re_pattern,0,0,
                                <<69,82,67,80,124,0,0,0,16,0,0,0,1,0,0,0,0,0,
                                  0,0,0,0,0,0,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                                  0,0,0,0,0,0,0,0,0,93,0,72,25,77,0,0,0,0,0,0,
                                  0,0,0,0,0,0,254,255,255,7,0,0,0,0,0,0,0,0,0,
                                  0,0,0,0,0,0,0,77,0,0,0,0,16,171,255,3,0,0,0,
                                  128,254,255,255,7,0,0,0,0,0,0,0,0,0,0,0,0,0,
                                  0,0,0,69,26,84,0,72,0>>},
                            10000,5426,"Tue, 10 Nov 2009 15:57:02 GMT"}
** Reason for termination ==
** {system_limit,[{erlang,spawn_opt,
                          [proc_lib,init_p,
                           [couch_server,
                            [couch_primary_services,couch_server_sup,<0.1.0>],
                            gen,init_it,
                            [gen_server,<0.785.1877>,<0.785.1877>,couch_db,
                             {<<"u/053/2a4/101325/notes">>,
                              "/srv/couchdb/database/u/053/2a4/101325/notes.couch",
                              <0.29700.1881>,
                              [{user_ctx,{user_ctx,<<"101325">>,[]}}]},
                             []]],
                           [link]]},
                  {proc_lib,start_link,5},
                  {couch_db,start_link,3},
                  {couch_server,handle_call,3},
                  {gen_server,handle_msg,5},
                  {proc_lib,init_p_do_apply,3}]}


[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
    {<0.780.1877>,supervisor_report,
     [{supervisor,{local,couch_primary_services}},
      {errorContext,child_terminated},
      {reason,
          {system_limit,
              [{erlang,spawn_opt,
                   [proc_lib,init_p,
                    [couch_server,
                     [couch_primary_services,couch_server_sup,<0.1.0>],
                     gen,init_it,
                     [gen_server,<0.785.1877>,<0.785.1877>,couch_db,
                      {<<"u/053/2a4/101325/notes">>,
                       "/srv/couchdb/database/u/053/2a4/101325/notes.couch",
                       <0.29700.1881>,
                       [{user_ctx,{user_ctx,<<"101325">>,[]}}]},
                      []]],
                    [link]]},
               {proc_lib,start_link,5},
               {couch_db,start_link,3},
               {couch_server,handle_call,3},
               {gen_server,handle_msg,5},
               {proc_lib,init_p_do_apply,3}]}},
      {offender,
          [{pid,<0.785.1877>},
           {name,couch_server},
           {mfa,{couch_server,sup_start_link,[]}},
           {restart_type,permanent},
           {shutdown,brutal_kill},
           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,start_error},
                {reason,{already_started,<0.785.1877>}},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

[Tue, 10 Nov 2009 15:58:59 GMT] [error] [<0.780.1877>] {error_report,<0.24.0>,
              {<0.780.1877>,supervisor_report,
               [{supervisor,{local,couch_primary_services}},
                {errorContext,shutdown},
                {reason,reached_max_restart_intensity},
                {offender,[{pid,<0.785.1877>},
                           {name,couch_server},
                           {mfa,{couch_server,sup_start_link,[]}},
                           {restart_type,permanent},
                           {shutdown,brutal_kill},
                           {child_type,supervisor}]}]}}

Reply via email to