Rajesh Balamohan created AVRO-3482:
--------------------------------------
Summary: DataFileReader should reuse MAGIC data read from
inputstream
Key: AVRO-3482
URL: https://issues.apache.org/jira/browse/AVRO-3482
Project: Apache Avro
Issue Type: Bug
Reporter: Rajesh Balamohan
[https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
{code}
byte[] magic = new byte[MAGIC.length];
in.seek(0);
int offset = 0;
int length = magic.length;
while (length > 0) {
int bytesRead = in.read(magic, offset, length);
if (bytesRead < 0)
throw new EOFException("Unexpected EOF with " + length + " bytes
remaining to read");
length -= bytesRead;
offset += bytesRead;
}
in.seek(0); <--- This will force the inputstream to switch to "random" io
policy in next read in cloud connectors!
if (Arrays.equals(MAGIC, magic)) // current format
return new DataFileReader<>(in, reader);
if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
return new DataFileReader12<>(in, reader);
{code}
With cloud stores, this can turn out to be expensive as the stream has to be
closed and reopened in cloud connectors (e.g s3).
It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
on to DataFileReader / DataFileReader12. This will ensure that, file can be
read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)