We have a 4 node riak 1.0.0 cluster running in production that we want to upgrade to 1.1.2. We set up a test environment that closely mimic's the production one. As close as we possibly can with ec2 hosts. First attempt to jump from 1.0.0 -> 1.1.2 failed. We took into account the mapred_system issue and the listkeys_backpressure issue. We decided to try 1.0.0 -> 1.0.3 since that would involve the mapred_system issue only. That upgrade worked. We then tried to upgrade 1.0.3 -> 1.1.2 and had similar problems. Details below.
-- # riak-admin transfers Attempting to restart script through sudo -u riak '[email protected]' waiting to handoff 14 partitions -- Sometimes this would show as many as 48 transfers. Always from the node that we upgraded. It would eventually show no transfers left. The upgrade from 1.0.0 -> 1.0.3 didn't do this. We tested a link walking query that is similar to what we run in production. On 2 of the 3 nodes still running 1.0.3 it worked fine. On the 3rd node, this happened: Curl run on 1.0.3 node: -- curl -v http://localhost:8098/riak/email_address/[email protected]/_,_,1 * About to connect() to localhost port 8098 (#0) * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 8098 (#0) > GET /riak/email_address/[email protected]/_,_,1 HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k > zlib/1.2.3.3 libidn/1.15 > Host: localhost:8098 > Accept: */* > < HTTP/1.1 500 Internal Server Error < Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue) < Expires: Wed, 25 Apr 2012 20:55:30 GMT < Date: Wed, 25 Apr 2012 20:45:30 GMT < Content-Type: text/html < Content-Length: 2068 < <html><head><title>500 Internal Server Error</title></head><body><h1>Internal Server Error</h1>The server encountered an error while processing this request:<br><pre>{error, {error, {badmatch, {eoi,[], [{{reduce,0}, {trace, [error], {error, [{module,riak_kv_w_reduce}, {partition,1438665674247607560106752257205091097473808596992}, {details, [{fitting, {fitting,<0.848.0>,#Ref<0.0.0.5472>, #Fun<riak_kv_mrc_pipe.3.19126064>,1}}, {name,{reduce,0}}, {module,riak_kv_w_reduce}, {arg,{rct,#Fun<riak_kv_mapreduce.reduce_set_union.2>,none}}, {output, {fitting,<0.847.0>,#Ref<0.0.0.5472>, #Fun<riak_kv_mrc_pipe.1.120571329>, #Fun<riak_kv_mrc_pipe.2.112900629>}}, {options, [{sink,{fitting,<0.119.0>,#Ref<0.0.0.5472>,sink,undefined}}, {log,sink}, {trace, {set,1,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}, {{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}]}, {q_limit,64}]}, {reason, {{badmatch,{'EXIT',noproc}}, [{riak_core_vnode_proxy,call,2}, {riak_pipe_vnode,queue_work_send,4}, {riak_pipe_vnode,queue_work_erracc,6}, {riak_kv_w_reduce,'-done/1-lc$^0/1-0-',3}, {riak_kv_w_reduce,done,1}, {riak_pipe_vnode_worker,wait_for_input,2}, {gen_fsm,handle_msg,7}, {proc_lib,init_p_do_apply,3}]}}, {state,{working,done}}]}}}]}}, [{riak_kv_mrc_pipe,collect_outputs,3}, {riak_kv_wm_link_walker,execute_segment,3}, {riak_kv_wm_link_walker,execute_query,3}, {riak_kv_wm_link_walker,to_multipart_mixed,2}, {webmachine_resource,resource_call,3}, {webmachine_resource,do,3}, {webmachine_decision_core,resource_call,1}, {webmachine_decision_core,decision,1}]}}</pre><P><HR><ADDRESS>mochiweb+webmachine web server</AD* Connection #0 to host localhost left intact * Closing connection #0 -- After that, this appeared in the error.log on that node: error.log on 1.0.3 node: -- 2012-04-25 20:45:30.373 [error] <0.119.0> webmachine error: path="/riak/email_address/[email protected]/_,_,1" {error,{error,{badmatch,{eoi,[],[{{reduce,0},{trace,[error],{error,[{module,riak_kv_w_reduce},{partition,1438665674247607560106752257205091097473808596992},{details,[{fitting,{fitting,<0.848.0>,#Ref<0.0.0.5472>,#Fun<riak_kv_mrc_pipe.3.19126064>,1}},{name,{reduce,0}},{module,riak_kv_w_reduce},{arg,{rct,#Fun<riak_kv_mapreduce.reduce_set_union.2>,none}},{output,{fitting,<0.847.0>,#Ref<0.0.0.5472>,#Fun<riak_kv_mrc_pipe.1.120571329>,#Fun<riak_kv_mrc_pipe.2.112900629>}},{options,[{sink,{fitting,<0.119.0>,#Ref<0.0.0.5472>,sink,undefined}},{log,sink},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}]},{q_limit,64}]},{reason,{{badmatch,{'EXIT',noproc}},[{riak_core_vnode_proxy,call,2},{riak_pipe_vnode,queue_work_send,4},{riak_pipe_vnode,queue_work_erracc,6},{riak_kv_w_reduce,'-done/1-lc$^0/1-0-',3},{riak_kv_w_reduce,done,1},{riak_pipe_vnode_worker,wait_for_input,2},{gen_fsm,handle_msg,7},{proc_lib,init_p_do_apply,3}]}},{state,{working,done}}]}}}]}},[{riak_kv_mrc_pipe,collect_outputs,3},{riak_kv_wm_link_walker,execute_segment,3},{riak_kv_wm_link_walker,execute_query,3},{riak_kv_wm_link_walker,to_multipart_mixed,2},{webmachine_resource,resource_call,3},{webmachine_resource,do,3},{webmachine_decision_core,resource_call,1},{webmachine_decision_core,decision,1}]}} -- And this was in the error.log on the 1.1.2 node: error.log on 1.1.2 node: -- 2012-04-25 20:45:30.234 [error] <0.1710.0> gen_fsm <0.1710.0> in state wait_for_input terminated with reason: no match of right hand value {'EXIT',noproc} in riak_core_vnode_proxy:call/2 2012-04-25 20:45:30.243 [error] <0.1710.0> CRASH REPORT Process <0.1710.0> with 0 neighbours crashed with reason: no match of right hand value {'EXIT',noproc} in riak_core_vnode_proxy:call/2 2012-04-25 20:45:30.245 [error] <0.331.0> Supervisor riak_pipe_vnode_worker_sup had child undefined started with {riak_pipe_vnode_worker,start_link,undefined} at <0.1710.0> exit with reason no match of right hand value {'EXIT',noproc} in riak_core_vnode_proxy:call/2 in context child_terminated -- On the 1.1.2 node, running the same curl, it returned a 404 after a long timeout: Curl run on the 1.1.2 node: -- curl -v http://localhost:8098/riak/email_address/[email protected]/_,_,1 * About to connect() to localhost port 8098 (#0) * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 8098 (#0) > GET /riak/email_address/[email protected]/_,_,1 HTTP/1.1 > User-Agent: curl/7.19.7 (x86_64-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k > zlib/1.2.3.3 libidn/1.15 > Host: localhost:8098 > Accept: */* > < HTTP/1.1 404 Object Not Found < Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted it blue) < Date: Wed, 25 Apr 2012 20:49:40 GMT < Content-Type: text/html < Content-Length: 193 < * Connection #0 to host localhost left intact * Closing connection #0 <HTML><HEAD><TITLE>404 Not Found</TITLE></HEAD><BODY><H1>Not Found</H1>The requested document was not found on this server.<P><HR><ADDRESS>mochiweb+webmachine web server</ADDRESS></BODY></HTML> -- There was nothing more in any error logs after this. Also pulling up direct keys fails on this node. Riak is installed from the debian packages. I can send you whatever other info is neccesary. We tried many different combinations, but I think this one is the most correct and produced the most useful error messages. Any help is appreciated. Thanks. _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
