Hi Dave,

Humm it's not really the same issue.

[error] <0.13320.0>@riak_cs_get_fsm:waiting_chunks:311 riak_cs_get_fsm: Cannot get S3 <<"independent-print-limited">> <<"independent/independent/2014-07-22/cover/cover.ppm">> block# {<<94,144,214,192,123,131,68,132,142,55,30,108,189,81,242,106>>,0}: {error,notfound}

We have this issue. And we have 32Go of RAM on each nodes.

I disable the AAE, because when we put a new file. it's available directly. But after some hour, it became wrong and we have errors. I will try it but if you have another idea....

Regards,

Charles

Le 22/07/2014 13:17, Dave Finster a écrit :
Hi Charles

We have a slightly different issue to yours in the majority of our requests succeed and only the odd one fails - or is that what you are observing on your cluster?

I've been talking about the issue off-list with Luke and Kelly. Luke took a look at some of our debug logs for Riak and suspects that we had over-committed the resources of our cluster. I've modified the configuration of our cluster as per his recommendations:

{multi_backend, [
   {be_default, riak_kv_eleveldb_backend, [
       {max_open_files, 14},
       {cache_size, 4194304},
       {data_root, "/var/db/riak/leveldb"}
   ]},
   {be_blocks, riak_kv_bitcask_backend, [
       {data_root, "/var/db/riak/bitcask"}
   ]}
]},

{anti_entropy, {off, []}},

Our environment is a 4 node cluster with 4GB of RAM each running on SmartOS (from Joyent). I applied these changes, but while things improved I still encountered the odd failure. I also de-activated n_val_1_get_requests and I haven't been able to reproduce the issues that I was encountering previously.

Thanks,
Dave

On 22 Jul 2014, at 9:01 pm, Charles Bijon <[email protected] <mailto:[email protected]>> wrote:

Hi,

We have the same issue there. But we have 45 riak/riak-cs nodes in production. Do you have any idea to correct it ?

Regards,

Charles


Le 17/07/2014 23:21, Dave Finster a écrit :
Hi Kelly

1.4.5 - Riak CS
1.4.8 - Riak
Anti Entropy is on (all nodes)

Deactivating n_val_1_get_requests still allows me to cause the issue (with less occurrence), however a different error has cropped up now:

2014-07-17 21:15:38 =ERROR REPORT====
webmachine error: path="/buckets/<bucket name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png" {exit,{{{{case_clause,{error,timeout}},[{riak_cs_manifest_fsm,handle_get_manifests,1,[{file,"src/riak_cs_manifest_fsm.erl"},{line,265}]},{riak_cs_manifest_fsm,waiting_command,3,[{file,"src/riak_cs_manifest_fsm.erl"},{line,201}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,494}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_fsm,sync_send_event,[<0.1383.0>,get_manifests,infinity]}},{gen_fsm,sync_send_event,[<0.1382.0>,get_manifest,infinity]}},[{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,ensure_doc,2,[{file,"src/riak_cs_wm_utils.erl"},{line,236}]},{riak_cs_wm_object,authorize,2,[{file,"src/riak_cs_wm_object.erl"},{line,64}]},{riak_cs_wm_common,authorize,2,[{file,"src/riak_cs_wm_common.erl"},{line,396}]},{riak_cs_wm_common,forbidden,2,[{file,"src/riak_cs_wm_common.erl"},{line,182}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]},{webmachine_resource,do,3, [{ file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]}]}
[{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,ensure_doc,2,[{file,"src/riak_cs_wm_utils.erl"},{line,236}]},{riak_cs_wm_object,authorize,2,[{file,"src/riak_cs_wm_object.erl"},{line,64}]},{riak_cs_wm_common,authorize,2,[{file,"src/riak_cs_wm_common.erl"},{line,396}]},{riak_cs_wm_common,forbidden,2,[{file,"src/riak_cs_wm_common.erl"},{line,182}]},{webmachine_resource,resource_call,3,[{file,"src/webmachine_resource.erl"},{line,186}]},{webmachine_resource,do,3,[{file,"src/webmachine_resource.erl"},{line,142}]},{webmachine_decision_core,resource_call,1,[{file,"src/webmachine_decision_core.erl"},{line,48}]}]

Thanks,
Dave

On 18 Jul 2014, at 2:19 am, Kelly McLaughlin <[email protected] <mailto:[email protected]>> wrote:

Dave,

Can you tell me what versions of Riak and Riak CS you have installed? Do you have AAE enabled or disabled? It's tough to come up with an explanation without more information, but I would try setting n_val_1_get_requests to false and see if you continue to experience the problem. My guess is that will resolve the issue, but let me know what happens.

Kelly

On July 17, 2014 at 1:00:19 AM, Dave Finster ([email protected] <mailto:[email protected]>) wrote:

Hi Everyone

Spent a bit of time trying to debug this one and not sure were to from here. The use case that appears to cause this breakage is a web page that links to 8 x 10MB images and it attempts to fetch them simultaneously.

Occasionally, one or two of the images will just fail to load, while other times they all work file. I've tracked it down to the crash below. It isn't always the same image. To make the problem more repeatable, I forced our load balancer into only using a single Riak-CS node, so it will be getting hit with all the requests. We are using HAProxy out the front and are running SmartOS 64-bit images across the board.

arekinath helped me look into it and one thought was that I was hit by the AAE bug prior to 1.4.8, but even clearing the AAE made no difference. The n-val on the buckets is 3 and its a 4-node cluster. All 4 nodes have both a Riak and a Riak-CS node on it. I also have pb_backlog turned up to 256, n_val_1_get_requests set to true and fold_objects_for_list_keys set to true. 'ring-status' shows that the whole ring is reachable.

Any idea on how to diagnose this one further?

2014-07-17 06:38:54 =CRASH REPORT====
crasher:
initial call: mochiweb_acceptor:init/3
pid: <0.26119.1>
registered_name: []
exception exit: {{normal,{gen_fsm,sync_send_event,[<0.27617.1>,get_next_chunk,infinity]}},[{gen_fsm,sync_send_event,3,[{file,"gen_fsm.erl"},{line,214}]},{riak_cs_wm_utils,streaming_get,4,[{file,"src/riak_cs_wm_utils.erl"},{line,272}]},{webmachine_decision_core,'-make_encoder_stream/3-fun-0-',3,[{file,"src/webmachine_decision_core.erl"},{line,667}]},{webmachine_request,send_stream_body_no_chunk,2,[{file,"src/webmachine_request.erl"},{line,334}]},{webmachine_request,send_response,3,[{file,"src/webmachine_request.erl"},{line,398}]},{webmachine_request,call,2,[{file,"src/webmachine_request.erl"},{line,251}]},{webmachine_decision_core,wrcall,1,[{file,"src/webmachine_decision_core.erl"},{line,42}]},{webmachine_decision_core,finish_response,3,[{file,"src/webmachine_decision_core.erl"},{line,92}]}]}
ancestors: [object_web_mochiweb,riak_cs_sup,<0.143.0>]
messages: []
links: [<0.298.0>,#Port<0.12015>]
dictionary: [{reqstate,{wm_reqstate,#Port<0.12015>,[{'content-encoding',"identity"},{'content-type',"application/octet-stream"},{resource_module,riak_cs_wm_object}],undefined,"10.4.242.1",{wm_reqdata,'GET',http,{1,1},"10.4.242.1",undefined,[],"/buckets/<the bucket name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png","/buckets/<the bucket name>/objects/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png?Signature=U0By3mIwaRIVBHNcYhSt6r5QgPk%3D&Expires=1405580057&AWSAccessKeyId=DGTXHHWIEDF4XUBSBYVI",[{bucket,"<the bucket name>"},{object,"bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f%2F847c340cfe2f44028d6fd5606f696796%2FAttachment-1.png"}],[],"../../../..",{200,undefined},1073741824,67108864,[{"_ga","GA1.3.643660316.1404789703"}],[{"Signature","U0By3mIwaRIVBHNcYhSt6r5QgPk="},{"Expires","1405580057"},{"AWSAccessKeyId","DGTXHHWIEDF4XUBSBYVI"}],{9,{"cookie",{'Cookie',"_ga=GA1.3.643660316.1404789703"},{"accept-language",{'Accept-Language',"en-US,en;q=0.8"},{"accept-encoding",{'Accept-Encoding',"gzip,deflate,sdch"},{"accept",{'Accept',"image/webp,*/*;q=0.8"},nil,nil},nil},{"connection",{'Connection',"keep-alive"},nil,nil}},{"referer",{'Referer',"<the referrer>"},{"host",{'Host',"<our riak-cs host name>"},nil,nil},{"user-agent",{'User-Agent',"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"},nil,{"x-rcs-rewrite-path",{"x-rcs-rewrite-path","/<the bucket name>/bf15f98c-eaa1-4ff9-83ff-24c1e7e1380f/847c340cfe2f44028d6fd5606f696796/Attachment-1.png?AWSAccessKeyId=DGTXHHWIEDF4XUBSBYVI&Expires=1405580057&Signature=U0By3mIwaRIVBHNcYhSt6r5QgPk%3D"},nil,nil}}}}},not_fetched_yet,false,{3,{"content-type",{"Content-Type","application/octet-stream"},nil,{"etag",{"ETag","\"a3a32cf5d8f502d7e8d35fd8412a6878\""},nil,
trap_exit: false
status: running
heap_size: 28657
stack_size: 24
reductions: 80773
neighbours:

Thanks,
Dave Finster
_______________________________________________
riak-users mailing list
[email protected] <mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to