[ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=758276&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-758276
 ]

ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Apr/22 04:53
            Start Date: 19/Apr/22 04:53
    Worklog Time Spent: 10m 
      Work Description: opwvhk commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r852594859


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws 
IOException {
     this.reader = reader;
   }
 
-  /** Initialize the stream by reading from its head. */
-  void initialize(InputStream in) throws IOException {
-    this.header = new Header();
-    this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+  byte[] readMagic() throws IOException {
+    if (this.vin == null) {
+      throw new IOException("InputStream is not initialized");
+    }
     byte[] magic = new byte[DataFileConstants.MAGIC.length];
     try {
       vin.readFixed(magic); // read magic
     } catch (IOException e) {
       throw new IOException("Not an Avro data file.", e);

Review Comment:
   Adding the exception cause already adds that message. We'll done.



##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws 
IOException {
     this.reader = reader;
   }
 
-  /** Initialize the stream by reading from its head. */
-  void initialize(InputStream in) throws IOException {
-    this.header = new Header();
-    this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+  byte[] readMagic() throws IOException {
+    if (this.vin == null) {
+      throw new IOException("InputStream is not initialized");
+    }
     byte[] magic = new byte[DataFileConstants.MAGIC.length];
     try {
       vin.readFixed(magic); // read magic
     } catch (IOException e) {
       throw new IOException("Not an Avro data file.", e);

Review Comment:
   Adding the exception cause already adds that message. Well done.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 758276)
    Time Spent: 1h 40m  (was: 1.5h)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to