Hi Erik / Mailing list participants --
After a decent amount of research I believe the timeout in the crash report is
unrelated to the timeout specified in the specification of the mr. The reduce
step itself is ok -- the mr succeeds much of the time. But when the data set
is too large we often see the timeout. We've run this mr hundreds of times
over different sizes of data and found something of a cutoff in size after
which the mr will start failing sporadically. Complicating things are the
other mr jobs which can be running simultaneously.
After learning more of the riak code and reviewing the error message I believe
that the timeout is related to an attempt to enqueue a result from a prior
reduce step. Maybe a queue is already maxed out and even a spillover queue is
maxed out. The "Offender" section in the crash report indicates a child worker
crashed and needs to be restarted, and maybe the restart is taking too long --
more than 2000 ms. I think that's where the timeout comes in. Perhaps someone
from basho could confirm (or reject) my theory.
So possible solutions may be adding a node, or reducing the size of the data to
be mr'd. Since we want to keep up with realtime reducing the size of the mr
means we'll have more mr jobs. I'm not certain this will improve the situation
but it's the cheapest solution. Adding a node will add computing power but not
partitions (from my reading) so if partitions are somehow the limitation it's a
more difficult problem.
If anyone has experience with this I'd appreciate your input.
Thanks!!
On Jul 26, 2012, at 3:03 AM, Erik Hoogeveen wrote:
> Hello John,
>
> I'm not really an expert but looking at your crash log my first guess is that
> an error occurs in the reduce part of the map/reduce of the job itself.
> More specifically I think you need to examine the meaning of this bit in your
> crash report:
>
>> {details,{fitting_details,{fitting,<11882.27640.38>,#Ref<11882.0.6211.103965>,follow,1},{prereduce,0},riak_kv_w_reduce,{rct,#Fun<reduce_inputs.reduceStatsList.2>,none},{fitting,<11882.27639.38>,#Ref<11882.0.6211.103965>,<<103,28,147,16,123,67,248,114,104,204,9,54,33,62,81,41,129,84,203,83>>,1},[{sink,{fitting,<11882.24969.38>,#Ref<11882.0.6211.103965>,sink,undefined}},{log,sink},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}],64}}]}]
>
>
> But like I said, it's just a hunch.
>
> Cheers,
> Erik Hoogeveen
>
> On 25 jul 2012, at 20:11, John Roy wrote:
>
>> Hi --
>>
>> I'm seeing an issue with timeouts for map/reduces. We're running erlang
>> files via a curl command, as
>> part of a haskell job. In the curl data we specify the timeout to be one
>> hour (3,600,000 milliseconds --
>> see the example below). However, the job crashes (times out) after well
>> less than an hour
>> (genarally 450-1000 seconds). See the sample crash below.
>>
>> Does anyone have an idea or insight on why that might occur? I've done some
>> searching on the riak_kv
>> code but haven't been able to trace the error through it yet.
>>
>> A sample mr (similar to this one) has 3398 keys which should map 278012
>> items to be reduced.
>>
>> We're using eleveldb back end with riak 1.1.1
>>
>> One other note, we can be running more than one mr over the same data
>> simultaneously.
>>
>> Thanks for your help!
>>
>>
>>
>> the input to the curls is something like this:
>> {
>> "inputs" : {
>> "bucket" : "data_bucket",
>> "index" : "minute_int",
>> "start" : 0,
>> "end" : 900
>> },
>> "query" : [
>> { "map" : {"language" : "erlang", "module" : "maps", "function" :
>> "emitStatsFromList", "keep" : false } },
>> { "reduce" : {"language" : "erlang", "module" : "reduces",
>> "function" : "reduceStatsList", "keep" : true } }
>> ],
>> "timeout" : 3600000
>> }
>>
>>
>> Here's a (cleansed) crash log:
>>
>> 2012-07-24 08:01:44 =CRASH REPORT====
>> crasher:
>> initial call: riak_pipe_vnode_worker:init/1
>> pid: <0.28552.254>
>> registered_name: []
>> exception exit:
>> {timeout,{gen_server,call,[{riak_pipe_vnode_master,'[email protected]'},{return_vnode,{riak_vnode_req_v1,593735040165679310520246963290989976735222595584,{raw,#Ref<11882.0.6211.103965>,<0.28552.254>},{cmd_enqueue,{fitting,<11882.27639.38>,#Ref<11882.0.6211.103965>,<<103,28,147,16,123,67,248,114,104,204,9,54,33,62,81,41,129,84,203,83>>,1},{"dummykey",1},infinity,[{593735040165679310520246963290989976735222595584,'[email protected]'}]}}}]}}
>> in function gen_fsm:terminate/7
>> in call from proc_lib:init_p_do_apply/3
>> ancestors: [<0.348.0>,<0.347.0>,riak_core_vnode_sup,riak_core_sup,<0.89.0>]
>> messages: []
>> links: [<0.348.0>,<0.347.0>]
>> dictionary:
>> [{eunit,[{module,riak_pipe_vnode_worker},{partition,662242929415565384811044689824565743281594433536},{<0.347.0>,<0.347.0>},{details,{fitting_details,{fitting,<11882.27640.38>,#Ref<11882.0.6211.103965>,follow,1},{prereduce,0},riak_kv_w_reduce,{rct,#Fun<reduce_inputs.reduceStatsList.2>,none},{fitting,<11882.27639.38>,#Ref<11882.0.6211.103965>,<<103,28,147,16,123,67,248,114,104,204,9,54,33,62,81,41,129,84,203,83>>,1},[{sink,{fitting,<11882.24969.38>,#Ref<11882.0.6211.103965>,sink,undefined}},{log,sink},{trace,{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[error],[],[],[],[],[],[],[],[],[],[],[],[],[]}}}}],64}}]}]
>> trap_exit: false
>> status: running
>> heap_size: 317811
>> stack_size: 24
>> reductions: 14770819
>> neighbours:
>> 2012-07-24 08:01:44 =SUPERVISOR REPORT====
>> Supervisor: {<0.348.0>,riak_pipe_vnode_worker_sup}
>> Context: child_terminated
>> Reason:
>> {timeout,{gen_server,call,[{riak_pipe_vnode_master,'[email protected]'},{return_vnode,{riak_vnode_req_v1,593735040165679310520246963290989976735222595584,{raw,#Ref<11882.0.6211.103965>,<0.28552.254>},{cmd_enqueue,{fitting,<11882.27639.38>,#Ref<11882.0.6211.103965>,<<103,28,147,16,123,67,248,114,104,204,9,54,33,62,81,41,129,84,203,83>>,1},{"dummykey",1},infinity,[{593735040165679310520246963290989976735222595584,'[email protected]'}]}}}]}}
>> Offender:
>> [{pid,<0.28552.254>},{name,undefined},{mfargs,{riak_pipe_vnode_worker,start_link,undefined}},{restart_type,temporary},{shutdown,2000},{child_type,worker}]
>> _______________________________________________
>> riak-users mailing list
>> [email protected]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com