[ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=762119&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762119
 ]

ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Apr/22 05:17
            Start Date: 26/Apr/22 05:17
    Worklog Time Spent: 10m 
      Work Description: thiru-mg commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r858275569


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws 
IOException {
     this.reader = reader;
   }
 
-  /** Initialize the stream by reading from its head. */
-  void initialize(InputStream in) throws IOException {
-    this.header = new Header();
-    this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+  byte[] readMagic() throws IOException {
+    if (this.vin == null) {
+      throw new IOException("InputStream is not initialized");
+    }
     byte[] magic = new byte[DataFileConstants.MAGIC.length];
     try {
       vin.readFixed(magic); // read magic
     } catch (IOException e) {
       throw new IOException("Not an Avro data file.", e);
     }
+    return magic;
+  }
+
+  void validateMagic(byte[] magic) throws InvalidAvroMagicException {
     if (!Arrays.equals(DataFileConstants.MAGIC, magic))
       throw new InvalidAvroMagicException("Not an Avro data file.");
+  }
+
+  /** Initialize the stream by reading from its head. */
+  void initializeWithMagic(InputStream in, byte[] magic) throws IOException {

Review Comment:
   To my taste, I would have left this function named `initialize`. The reason 
is there is a pattern followed throughout to call the non-constructor 
initializers `initialize()`. But I wouldn't insist on it.



##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java:
##########
@@ -128,15 +127,20 @@ public DataFileReader(File file, DatumReader<D> reader) 
throws IOException {
    * <pre/>
    */
   public DataFileReader(SeekableInput sin, DatumReader<D> reader) throws 
IOException {
-    this(sin, reader, false);
+    this(sin, reader, false, null);
+  }
+
+  public DataFileReader(SeekableInput sin, DatumReader<D> reader, byte[] 
magic) throws IOException {

Review Comment:
   Since the magic is embedded in the file and nobody outside the class (let 
alone the package) is likely to have the magic, it is better to make this 
constructor private. If we come across a valid use case in the future, we can 
then make it public.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 762119)
    Time Spent: 2.5h  (was: 2h 20m)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to