We were getting this exact same problem in a really simple MR job, on input
produced from a known-working MR job.
It seemed to happen intermittently, and we couldn't figure out what was up.
In the end we solved the problem by increasing the number of maps (80 to
200, this is a 6 node, 12 code cluster). Apparently, QuickSort can have
problems with big chunks of pre-sorted data. Too much recursion, I believe.
This might not be what's going on with you, maybe you're on a cluster of
some other scale, but this worked for us (and in a setup with Hadoop 0.17.)
Good luck!
-Colin
On Mon, Jun 2, 2008 at 3:18 PM, Devaraj Das [EMAIL PROTECTED] wrote:
Hi, do you have a testcase that we can run to reproduce this? Thanks!
-Original Message-
From: jkupferman [mailto:[EMAIL PROTECTED]
Sent: Monday, June 02, 2008 9:22 AM
To: core-user@hadoop.apache.org
Subject: Stack Overflow When Running Job
Hi everyone,
I have a job running that keeps failing with Stack Overflows
and I really dont see how that is happening.
The job runs for about 20-30 minutes before one task errors,
then a few more error and it fails.
I am running hadoop-17 and ive tried lowering these settings
to no avail:
io.sort.factor50
io.seqfile.sorter.recordlimit 50
java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
MapTask.java:594)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
MapTask.java:576)
at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
at Group.write(Group.java:68)
at GroupPair.write(GroupPair.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$Writable
Serializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$Writable
Serializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa
sk.java:434)
at MyMapper.map(MyMapper.java:27)
at MyMapper.map(MyMapper.java:10)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
Caused by: java.lang.StackOverflowError
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at Group.readFields(Group.java:62)
at GroupPair.readFields(GroupPair.java:60)
at
org.apache.hadoop.io.WritableComparator.compare(WritableCompar
ator.java:91)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa
sk.java:494)
at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
the above line repeated 200x
I defined writeablecomparable called GroupPair which simply
holds to Group objects, each of which contains two integers.
I fail to see how QuickSort could recurse 200+ times since
that would require an insanely large amount of entries , far
more then the 500 million that had been output at that point.
How is this even possible? And what can be done to fix this?
--
View this message in context:
http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935
94p17593594.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.