I am doing 208 MapReduce jobs in rapid-fire succession using anonymous 
JavaScript functions.  I am sending the MapReduce jobs to a single node, 
riak01.  There are about 75,000 keys in the bucket.
Erlang: R13B04
Riak: 0.14.2

When I had my MapReduce timeout set to 120,000 ("timeout":120000), I was getting
mapexec_error, {error,timeout}
This first timeout wrote to the error log after seven seconds.  The second and 
third wrote to the error log after five seconds.  The four timeout wrote the 
error log after eight seconds.  The beam process never crashed.

So, I increased the value to 30,000,000 ("timeout":30000000).  In the first 
run, all MapReduce jobs completed without error, each one taking about 1 to 3 
seconds to complete.
The CPU usage on riak01 was about 50 percent for all 208 jobs.
Below is a sample output from iostat -x
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          51.00    0.00    5.01    0.10    0.00   43.89

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
hda               0.00     8.22    0.00    3.21     0.00    91.38    28.50     
0.01    2.62   2.12   0.68

In the second run, on the 53rd MapReduce job, the job was still waiting to 
complete after 10 minutes.  So, there was never a timeout, and nothing was 
written to the error logs.  However, the beam process obviously crashed.  On 
raik01, I executed the following commands:
./riak-admin status
Node is not running!
./riak ping
Node '[email protected]' not responding to pings.
./riak attach
Node is not running!

However, ps and top showed the process running.
ps output:
1003     31807  1.0  8.7 172080 132584 pts/1   Rsl+ Jun22  28:53     
/home/DMitchell/riak2/riak/rel/riak/erts-5.7.5/bin/beam -K true -A 64 -- -root 
/home/DMitchell/riak2/riak/rel/riak -progname riak -- -home /home/DMitchell -- 
-boot /home/DMitchell/riak2/riak/rel/riak/releases/0.14.2/riak -embedded 
-config /home/DMitchell/riak2/riak/rel/riak/etc/app.config -name 
[email protected] -setcookie riak -- console

top output:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31807 DMitchel  25   0  168m 129m 4360 R 99.3  8.7  30:46.08 beam

Below is a sample output from iostat -x when beam was in the crashed state:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
         100.00    0.00    0.00    0.00    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz 
avgqu-sz   await  svctm  %util
hda               0.00     1.82    0.00    0.61     0.00    19.45    32.00     
0.00    2.00   2.00   0.12

Note the 100 percent CPU usage for the beam process.  I terminated the beam 
process with: kill -s TERM 31807.  Then, I restarted riak.

There were no errors on the other two nodes, except for:
=ERROR REPORT==== 24-Jun-2011::12:29:36 ===
** Node '[email protected]' not responding **
** Removing (timedout) connection **

The MapReduce job is not that complex.  I am using a key filter.  The map phase 
looks for an "LoadRange", and creates a new variable (e.g., "Load1", if there 
is a match.  The reduce phase counts the matches.
{
                "inputs" : {
                                "bucket" : "names-51013",
                                "key_filters" : [["starts_with", "22204-1-3"]]
                },
                "query" : [{
                                                "map" : {
                                                                "keep" : false,
                                                                "language" : 
"javascript",
                                                                "arg" : null,
                                                                "source" : 
"function(value,keyData,arg){var 
data=Riak.mapValuesJson(value)[0];if(data.LoadRange&&data.LoadRange==1) 
return[{\"data.Load1\":1}];else if(data.LoadRange&&data.LoadRange==2) 
return[{\"data.Load2\":1}];else if(data.LoadRange&&data.LoadRange==3) 
return[{\"data.Load3\":1}];else if(data.LoadRange&&data.LoadRange==4) 
return[{\"data.Load4\":1}];else if(data.LoadRange&&data.LoadRange==5) 
return[{\"data.Load5\":1}];else if(data.LoadRange&&data.LoadRange==6) 
return[{\"data.Load6\":1}];else if(data.LoadRange&&data.LoadRange==7) 
return[{\"data.Load7\":1}];else if(data.LoadRange&&data.LoadRange==8) 
return[{\"data.Load8\":1}];else if(data.LoadRange&&data.LoadRange==9) 
return[{\"data.Load9\":1}];else if(data.LoadRange&&data.LoadRange==10) 
return[{\"data.Load10\":1}];else return[];}"
                                                }
                                }, {
                                                "reduce" : {
                                                                "keep" : true,
                                                                "language" : 
"javascript",
                                                                "arg" : null,
                                                                "source" : 
"function(v){var s={};for(var i in v){for(var n in v[i]){if(n in s) 
s[n]+=v[i][n];else s[n]=v[i][n];}} return[s];}"
                                                }
                                }
                ],
                "timeout" : 30000000
}


The MapReduce timeout seem to be happening at different places, e.g., during 
the map phase, during the reduce phase and during the key filtering phase 
(#Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}).

See the URL below for the complete sasl-error.log right before a recent beam 
crash.
https://gist.github.com/1045386

Can anyone shed any light on why I am getting timeouts and crashes?

David
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to