[
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=758286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-758286
]
ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------
Author: ASF GitHub Bot
Created on: 19/Apr/22 05:39
Start Date: 19/Apr/22 05:39
Worklog Time Spent: 10m
Work Description: opwvhk commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r852595847
##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java:
##########
@@ -61,6 +61,7 @@ public DataFileReader12(SeekableInput sin, DatumReader<D>
reader) throws IOExcep
this.in = new DataFileReader.SeekableInputStream(sin);
byte[] magic = new byte[4];
+ in.seek(0); // seek to 0 to read magic header
in.read(magic);
Review Comment:
~If fewer bytes come back, `read()` returns another number than 4 (this is
discarded), and the leftover part of the array contains zeroes.~
~This fails the magic check later.~
The mismatch @rbalamohan mentioned only occurs with network traffic, and
then only if the first part of the network packet contains an application
specific header of up to 1459 bytes, fragmenting the first 4 bytes of the avro
file.
Though unlikely, it's probably best to implement `readFully` here as well.
Issue Time Tracking
-------------------
Worklog Id: (was: 758286)
Time Spent: 2h (was: 1h 50m)
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance, pull-request-available
> Time Spent: 2h
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)