Hi Toby, Can you try raising the pb_backlog to 128 in your riak app.config on each node. It's likely those disconnect errors are left over from the stampede of connections from the CS connection pool on startup. For one reason or another the resets don't come through and the hanging disconnected socket isn't discovered until the next send attempt.
-Andrew On Wed, Sep 18, 2013 at 10:31 PM, Toby Corkindale < [email protected]> wrote: > On 19/09/13 11:17, Luke Bakken wrote: > >> The "error: disconnected" message is a good clue. If you can provide >> log files that may point to the cause. >> > > See below for logs from riak and riak cs, for a roughly five minute window > during which I'd fiddled with the load balancer and DNS to try and isolate > this server from other requests, and only send mine to it. > > Unfortunately I really don't see much in there apart from the disconnected > messages. I did note that there were warnings about system hitting high > watermarks for memory at one point though. That's a bit odd, as the machine > wasn't near max capacity at any point - it's only 8GB, but 'free' was > reporting that most was being used just for buffers and cache at the time; > as follows: > > ie. > total used free shared buffers cached > Mem: 8176412 3803404 4373008 0 34112 3052340 > -/+ buffers/cache: 716952 7459460 > Swap: 3903484 0 3903484 > > > ======= erlang.log (riak cs) ========= > > ===== Thu Sep 19 11:58:46 EST 2013 > 11:58:46.287 [info] alarm_handler: {set,{system_memory_high_** > watermark,[]}}^M > 12:00:47.864 [info] webmachine_log_handler: closing log file: > "/var/log/riak-cs/access.log" > ^M > 12:00:47.865 [info] opening log file: "/var/log/riak-cs/access.log.** > 2013_09_19_02" > ^M > 12:00:50.081 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:00:51.013 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:00:51.406 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:01:49.320 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:02:04.662 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:03:37.330 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:06:24.345 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:06:41.670 [error] Retrieval of user record for s3 failed. Reason: > disconnected^M > 12:06:46.554 [info] Finished garbage collection: 0 seconds, 0 batch_count, > 0 batch_skips, 0 manif_count, 0 block_count^M > 12:11:46.300 [info] alarm_handler: {clear,system_memory_high_** > watermark}^M > 12:16:46.305 [info] alarm_handler: {set,{system_memory_high_** > watermark,[]}}^M > 12:18:46.307 [info] alarm_handler: {clear,system_memory_high_** > watermark}^M > ==============================**========== > > > ===== console.log (riak cs) =================== > 2013-09-19 12:00:47.865 [info] <0.94.0> opening log file: > "/var/log/riak-cs/access.log.**2013_09_19_02" > > 2013-09-19 12:00:50.081 [error] > <0.7092.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:00:51.013 [error] > <0.10406.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:00:51.406 [error] > <0.11179.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:01:49.320 [error] > <0.11197.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:02:04.662 [error] > <0.9667.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:03:37.330 [error] > <0.11343.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:06:24.345 [error] > <0.11589.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:06:41.670 [error] > <0.11215.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:06:46.554 [info] > <0.274.0>@riak_cs_gc_d:**fetching_next_fileset:162 > Finished garbage collection: 0 seconds, 0 batch_count, 0 batch_skips, 0 > manif_count, 0 block_count > 2013-09-19 12:11:46.300 [info] <0.41.0> alarm_handler: > {clear,system_memory_high_**watermark} > 2013-09-19 12:16:46.305 [info] <0.41.0> alarm_handler: > {set,{system_memory_high_**watermark,[]}} > 2013-09-19 12:18:46.307 [info] <0.41.0> alarm_handler: > {clear,system_memory_high_**watermark} > ==============================**================ > > > > ======= error.log (riak cs) ================ > 2013-09-19 12:00:50.081 [error] > <0.7092.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:00:51.013 [error] > <0.10406.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:00:51.406 [error] > <0.11179.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:01:49.320 [error] > <0.11197.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:02:04.662 [error] > <0.9667.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:03:37.330 [error] > <0.11343.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:06:24.345 [error] > <0.11589.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > 2013-09-19 12:06:41.670 [error] > <0.11215.0>@riak_cs_wm_common:**maybe_create_user:223 > Retrieval of user record for s3 failed. Reason: disconnected > ==============================**====== > > > > ========= erlang.log (riak) ========== > ===== ALIVE Thu Sep 19 12:01:04 EST 2013 > > ===== ALIVE Thu Sep 19 12:16:04 EST 2013 > ==============================**======== > > > =========== console.log (riak) ========= > 2013-09-19 12:03:05.471 [info] > <0.146.0>@riak_core_gossip:**log_membership_changes:372 > 'riak@mel-storage04.**strategicdata.internal' joined cluster with status > 'joining' > 2013-09-19 12:03:21.048 [info] > <0.146.0>@riak_core_gossip:**log_membership_changes:378 > 'riak@mel-storage04.**strategicdata.internal' changed from 'joining' to > 'valid' > 2013-09-19 12:03:26.235 [info] > <0.182.0>@riak_core_handoff_**manager:handle_info:286 > An outbound handoff of partition riak_kv_vnode > 274031556999544297163190906134**303066185487351808 was terminated for > reason: {shutdown,max_concurrency} > > ^^ very similar messages to that repeat a few hundred times ^^ > > ==============================**============ > > > ______________________________**_________________ > riak-users mailing list > [email protected] > http://lists.basho.com/**mailman/listinfo/riak-users_**lists.basho.com<http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
