Ben Roling created MAPREDUCE-5767:
-------------------------------------
Summary: Data corruption when single value exceeds map buffer size
(io.sort.mb)
Key: MAPREDUCE-5767
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5767
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mrv1
Affects Versions: 0.20.1
Reporter: Ben Roling
There is an issue in org.apache.hadoop.mapred.MapTask in 0.20 that can cause
data corruption when the size of a single value produced by the mapper exceeds
the size of the map output buffer (roughly io.sort.mb).
I experienced this issue in CDH4.2.1, but am logging the issue here for greater
visibility in case anyone else might run across the issue.
The issue does not exist in 0.21 and beyond due to the implementation of
MAPREDUCE-64. That JIRA significantly changes the way the map output buffering
is done and it looks like the issue has been resolved by those changes.
I expect this bug will likely be closed / won't fix due to the fact that 0.20
is obsolete. As stated previously, I am just logging this issue for visibility
in case anyone else is still running something based on 0.20 and encounters the
same problem.
In my situation the issue manifested as an ArrayIndexOutOfBoundsException in
the reduce phase when deserializing a key -- causing the job to fail. However,
I think the problem could manifest in a more dangerous fashion where the
affected job succeeds, but produces corrupt output. The stack trace I saw was:
2014-02-13 01:07:34,690 WARN org.apache.hadoop.mapred.Child: Error running child
java.lang.ArrayIndexOutOfBoundsException: 24
at
org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:173)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:135)
at
org.apache.crunch.types.avro.SafeAvroSerialization$AvroWrapperDeserializer.deserialize(SafeAvroSerialization.java:86)
at
org.apache.crunch.types.avro.SafeAvroSerialization$AvroWrapperDeserializer.deserialize(SafeAvroSerialization.java:70)
at
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:135)
at
org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:114)
at
org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:291)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:163)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
The problem appears to me to be in
org.apache.hadoop.mapred.MapTask.MapOutputBuffer.Buffer.write(byte[], int,
int). The sequence of events that leads up to the issue is:
* some complete records (cumulative size less than total buffer size) written
to buffer
* large (over io.sort.mb) record starts writing
* [soft buffer limit
exceeded|https://github.com/apache/hadoop-common/blob/release-0.20.1/src/mapred/org/apache/hadoop/mapred/MapTask.java#L1030]
- spill starts
* write of large record continues
* buffer becomes
[full|https://github.com/apache/hadoop-common/blob/release-0.20.1/src/mapred/org/apache/hadoop/mapred/MapTask.java#L1012]
*
[wrap|https://github.com/apache/hadoop-common/blob/release-0.20.1/src/mapred/org/apache/hadoop/mapred/MapTask.java#L1013]
evaluates to true, suggesting the buffer can be safely wrapped
* writing the large record continues until a write occurs such that bufindex +
len == bufstart exactly. When this happens
[buffull|https://github.com/apache/hadoop-common/blob/release-0.20.1/src/mapred/org/apache/hadoop/mapred/MapTask.java#L1018]
evaluates to false, so the data gets written to the buffer without event
* writing of the large value continues with another call to write(), starting
the corruption of the buffer. Buffer full can no longer be detected by the
[buffull
logic|https://github.com/apache/hadoop-common/blob/release-0.20.1/src/mapred/org/apache/hadoop/mapred/MapTask.java#L1012]
that is used when bufindex >= bufstart
The key to this problem occurring is a write where bufindex + len equals
bufstart exactly.
I have titled the issue as having to do with writing large records (over
io.sort.mb), but really I think the issue *could* occur on smaller records if
the serializer generated a write of exactly the right size. For example, if
the buffer is getting close to full, but hasn't exceeded the buffer soft limit
and then a collect() on a new value is called that triggers a write() such that
bufindex + len == bufstart. The size of the write would have to be relatively
large -- greater than the free space offered by the soft limit (20% of the
buffer by default), making the issue occurring that way pretty unlikely.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)