[ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=755620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755620
 ]

ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 06:44
            Start Date: 12/Apr/22 06:44
    Worklog Time Spent: 10m 
      Work Description: rbalamohan commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r848034895


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java:
##########
@@ -69,10 +69,9 @@ public static <D> FileReader<D> openReader(SeekableInput in, 
DatumReader<D> read
       length -= bytesRead;
       offset += bytesRead;
     }
-    in.seek(0);
 
     if (Arrays.equals(MAGIC, magic)) // current format
-      return new DataFileReader<>(in, reader);
+      return newreturn new DataFileReader<>(in, reader, magic);

Review Comment:
   Something went wrong in applying the patch in the branch being uploaded for 
PR. Let me fix it.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 755620)
    Time Spent: 0.5h  (was: 20m)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to