[jira] [Work logged] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

ASF GitHub Bot (Jira) Mon, 18 Apr 2022 22:40:07 -0700


     [ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=758286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-758286
 ]


ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Apr/22 05:39
            Start Date: 19/Apr/22 05:39
    Worklog Time Spent: 10m 
      Work Description: opwvhk commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r852595847


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java:
##########
@@ -61,6 +61,7 @@ public DataFileReader12(SeekableInput sin, DatumReader<D> 
reader) throws IOExcep
     this.in = new DataFileReader.SeekableInputStream(sin);
 
     byte[] magic = new byte[4];
+    in.seek(0); // seek to 0 to read magic header
     in.read(magic);

Review Comment:
   ~If fewer bytes come back, `read()` returns another number than 4 (this is 
discarded), and the leftover part of the array contains zeroes.~
   ~This fails the magic check later.~
   
   The mismatch @rbalamohan mentioned only occurs with network traffic, and 
then only if the first part of the network packet contains an application 
specific header of up to 1459 bytes, fragmenting the first 4 bytes of the avro 
file.
   
   Though unlikely, it's probably best to implement `readFully` here as well.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 758286)
    Time Spent: 2h  (was: 1h 50m)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

Reply via email to