[
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=762209&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762209
]
ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------
Author: ASF GitHub Bot
Created on: 26/Apr/22 10:13
Start Date: 26/Apr/22 10:13
Worklog Time Spent: 10m
Work Description: rbalamohan commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r858536753
##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java:
##########
@@ -61,6 +61,7 @@ public DataFileReader12(SeekableInput sin, DatumReader<D>
reader) throws IOExcep
this.in = new DataFileReader.SeekableInputStream(sin);
byte[] magic = new byte[4];
+ in.seek(0); // seek to 0 to read magic header
in.read(magic);
Review Comment:
since magic has been already initialized to 4 bytes, partial reads will
anyways not match the magic. This wouldn't be an issue.
Issue Time Tracking
-------------------
Worklog Id: (was: 762209)
Time Spent: 2h 40m (was: 2.5h)
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance, pull-request-available
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)