Possible issue with BinaryData.compare(...) used in Map/Reduce
--------------------------------------------------------------
Key: AVRO-764
URL: https://issues.apache.org/jira/browse/AVRO-764
Project: Avro
Issue Type: Bug
Components: doc
Affects Versions: 1.4.1
Environment: Linux, CDH3
Reporter: Harsh J Chouraria
Assignee: Harsh J Chouraria
Priority: Critical
Fix For: 1.5.0
MapReduce's RawComparator feature has a call {{compare(b1, start1, length1, b2,
start2, length2)}} which is handled for Avro using {{BinaryData.compare(b1,
start1, b2, start2, schema)}}
BinaryDecoder, used by the BinaryData.compare(b1, start1, b2, start2, schema)
utility, has a sub-clause that handles byte array inputs of less than 16 bytes
in length.
This is the exact code which am talking about:
{code:title=BinaryDecoder.java Lines 879-893|borderStyle=solid}
private ByteArrayByteSource(byte[] data, int start, int len) {
super();
// make sure data is not too small, otherwise getLong may try and
// read 10 bytes and get index out of bounds.
if (data.length < 16 || len < 16) {
this.data = new byte[16];
System.arraycopy(data, start, this.data, 0, len);
this.position = 0;
this.max = len;
} else {
// use the array passed in
this.data = data;
this.position = start;
this.max = start + len;
}
}
{code}
This clause would fail during {{arrayCopy}} since the target {{len}} bytes is
sent into the constructor as the {{length}} of the byte array input itself, and
not the length as given by the framework which should be the right one [Since
the BD.compare() call does not take it as a parameter]. Thus, arrayCopy would
be trying to copy from {{start}} byte index to {{length}} bytes after it, where
{{length}} is the byte array's {{.length}} itself -- which leads to an index
bounds exception.
Here is a slightly low level test case that should fail with an
{{ArrayIndexOutOfBoundsException}} due to a bad {{System.arrayCopy}} call:
{code:title=TestBinaryData.java|borderStyle=solid}
package org.apache.avro.io;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Field;
import org.apache.avro.Schema.Type;
import org.apache.avro.generic.GenericData.Record;
import org.apache.avro.generic.GenericDatumWriter;
import org.junit.Assert;
import org.junit.Test;
public class TestBinaryData {
@Test
public void testCompare() {
// Prepare a schema for testing.
Field integerField = new Field("test", Schema.create(Type.INT), null, null);
List<Field> fields = new ArrayList<Field>();
fields.add(integerField);
Schema record = Schema.createRecord("test", null, null, false);
record.setFields(fields);
ByteArrayOutputStream b1 = new ByteArrayOutputStream(5);
ByteArrayOutputStream b2 = new ByteArrayOutputStream(5);
BinaryEncoder b1Enc = new BinaryEncoder(b1);
BinaryEncoder b2Enc = new BinaryEncoder(b2);
Record testDatum1 = new Record(record);
testDatum1.put(0, 1);
Record testDatum2 = new Record(record);
testDatum2.put(0, 2);
GenericDatumWriter<Record> gWriter = new GenericDatumWriter<Record>(record);
Integer start1 = 0, start2 = 0;
try {
gWriter.write(testDatum1, b1Enc);
b1Enc.flush();
start1 = b1.size();
gWriter.write(testDatum1, b1Enc);
b1Enc.flush();
b1.close();
gWriter.write(testDatum2, b2Enc);
b2Enc.flush();
start2 = b2.size();
gWriter.write(testDatum2, b2Enc);
b2Enc.flush();
b2.close();
BinaryData.compare(b1.toByteArray(), start1, b2.toByteArray(), start2,
record);
} catch (IOException e) {
Assert.fail("IOException while writing records to output stream.");
}
}
}
{code}
A solution would be to let the length, as given by the MapReduce framework
itself, be used instead of using {{bytearray.length}} in its place. I'll attach
a patch for this soon.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira