[ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=755599&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755599
 ]

ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Apr/22 05:04
            Start Date: 12/Apr/22 05:04
    Worklog Time Spent: 10m 
      Work Description: rbalamohan opened a new pull request, #1639:
URL: https://github.com/apache/avro/pull/1639

   DataFileReader reads magic information twice. seek(0) is invoked
   twice due to this. In cloud object stores, seeking back to 0 will
   cause it to fall back to "random IO policy". Example of this is
   S3A connector for s3. This causes suboptimal reads in object stores.
   Refactoring in the patch addresses this case by reusing MAGIC.
   
   ### Jira
   https://issues.apache.org/jira/browse/AVRO-3482
   
   ### Tests
   
   - Existing test cases cover this refactoring.
   
   ### Commits
   
   
   ### Documentation
   N/A




Issue Time Tracking
-------------------

            Worklog Id:     (was: 755599)
    Remaining Estimate: 0h
            Time Spent: 10m

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to