[jira] [Work logged] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

ASF GitHub Bot (Jira) Wed, 20 Apr 2022 13:30:33 -0700


     [ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=759550&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759550
 ]


ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 20/Apr/22 20:29
            Start Date: 20/Apr/22 20:29
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r854529247


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws 
IOException {
     this.reader = reader;
   }
 
-  /** Initialize the stream by reading from its head. */
-  void initialize(InputStream in) throws IOException {
-    this.header = new Header();
-    this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+  byte[] readMagic() throws IOException {
+    if (this.vin == null) {
+      throw new IOException("InputStream is not initialized");
+    }
     byte[] magic = new byte[DataFileConstants.MAGIC.length];
     try {
       vin.readFixed(magic); // read magic
     } catch (IOException e) {
       throw new IOException("Not an Avro data file.", e);

Review Comment:
   > Adding the exception cause adds that message. Let's do this the same as in 
line 108.
   
   yes, if the whole exception history is printed. no if only the first error 
string gets logged, or the chain of causes gets stripped after a couple of 
wrappings (hello hive!). if the filename is known, that's handy to include too





Issue Time Tracking
-------------------

    Worklog Id:     (was: 759550)
    Time Spent: 2h 20m  (was: 2h 10m)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work logged] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

Reply via email to