Rajesh Balamohan created AVRO-3482:
--------------------------------------

             Summary: DataFileReader should reuse MAGIC data read from 
inputstream
                 Key: AVRO-3482
                 URL: https://issues.apache.org/jira/browse/AVRO-3482
             Project: Apache Avro
          Issue Type: Bug
            Reporter: Rajesh Balamohan


[https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]

 
{code}
byte[] magic = new byte[MAGIC.length];
    in.seek(0);
    int offset = 0;
    int length = magic.length;
    while (length > 0) {
      int bytesRead = in.read(magic, offset, length);
      if (bytesRead < 0)
        throw new EOFException("Unexpected EOF with " + length + " bytes 
remaining to read");

      length -= bytesRead;
      offset += bytesRead;
    }
    in.seek(0); <--- This will force the inputstream to switch to "random" io 
policy in next read in cloud connectors!

    if (Arrays.equals(MAGIC, magic)) // current format
      return new DataFileReader<>(in, reader);
    if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
      return new DataFileReader12<>(in, reader);

 
{code}
 

With cloud stores, this can turn out to be expensive as the stream has to be 
closed and reopened in cloud connectors (e.g s3).

It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
on to DataFileReader / DataFileReader12. This will ensure that, file can be 
read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to