I'm running a simple perl script on top of a relatively large data set
(120GB gziped) on EC2.
The perl script is trivial :
while(<STDIN>)
{
chomp;
$l=$_;
$l=~/(.*)\^(.*)\^(.*)\^(.*)\^(.*)/;
$url=$2;$docID=$1;$tS=$3;
if ($2 =~ m/http/) {print "$url\t$docID\t$tS\n";}
}
The command I use to run is
./hadoop jar ../contrib/hadoop-streaming.jar -mapper "perl print.pl" -input
input/* -output out -file print.pl -reducer NONE
This step works fine.
If I add -reducer "uniq -c" (or just remove -reducer NONE) every map fails
with the following errors:
java.io.IOException: MROutput/MRErrThread failed:java.lang.OutOfMemoryError:
Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.io.Text.write(Text.java:243)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:338)
at
org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:34
4)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:76)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:189)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1777)
I tried changing amount of heap space in conf/hadoop-env.sh but nothing
helped.
What should I do?
Dejan