Ruiqi Dong created AVRO-4268:
--------------------------------
Summary: BytesWritableConverter serializes unused capacity bytes
Key: AVRO-4268
URL: https://issues.apache.org/jira/browse/AVRO-4268
Project: Apache Avro
Issue Type: Bug
Components: java
Reporter: Ruiqi Dong
*Summary*
`AvroDatumConverterFactory.BytesWritableConverter` converts Hadoop
`BytesWritable` to Avro `bytes` using `ByteBuffer.wrap(input.getBytes())`.
`BytesWritable.getBytes()` returns the backing array, whose length can be
larger than the logical value length reported by `getLength()`. The converter
therefore serializes stale or unused capacity bytes.
*Affected code*
File:
`lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroDatumConverterFactory.java`
{code:java}
@Override
public ByteBuffer convert(BytesWritable input) {
return ByteBuffer.wrap(input.getBytes());
} {code}
*Reproducer*
Add this test to
`lang/java/mapred/src/test/java/org/apache/avro/hadoop/io/TestAvroDatumConverterFactory.java`
{code:java}
@Test
void convertBytesWritableRespectsLogicalLength() {
AvroDatumConverter<BytesWritable, ByteBuffer> converter =
mFactory.create(BytesWritable.class);
BytesWritable writable = new BytesWritable(new byte[] { 1, 2, 3, 4, 5 });
writable.setSize(3);
ByteBuffer bytes = converter.convert(writable);
assertEquals(3, bytes.remaining());
assertEquals(1, bytes.get(0));
assertEquals(2, bytes.get(1));
assertEquals(3, bytes.get(2));
} {code}
Run:
{code:java}
MAVEN_SKIP_RC=true
JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home
\
PATH=/opt/homebrew/Cellar/openjdk@21/21.0.6/libexec/openjdk.jdk/Contents/Home/bin:/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin
\
/opt/homebrew/bin/mvn -q -t toolchains-local.xml -pl lang/java/mapred -am \
-Dtest=org.apache.avro.hadoop.io.TestAvroDatumConverterFactory#convertBytesWritableRespectsLogicalLength
\
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
-Dinvoker.skip=true -Drat.skip=true test
{code}
*Observed behavior*
The converted `ByteBuffer` has `remaining() == 5`.
*Expected behavior*
The converted `ByteBuffer` should have `remaining() == input.getLength()`,
which is `3` in the reproducer.
Hadoop `BytesWritable` separates capacity from logical length. Avro `bytes`
should encode only the logical value, not unused backing-array capacity. The
likely fix is `ByteBuffer.wrap(input.copyBytes())` or
`ByteBuffer.wrap(input.getBytes(), 0, input.getLength())`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)