[
https://issues.apache.org/jira/browse/AVRO-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530580#comment-17530580
]
ASF subversion and git services commented on AVRO-3482:
-------------------------------------------------------
Commit 2839bb76d525f63e63b6652df399c375a5f11e0d in avro's branch
refs/heads/branch-1.11 from Rajesh Balamohan
[ https://gitbox.apache.org/repos/asf?p=avro.git;h=2839bb76d ]
AVRO-3482: Reuse MAGIC in DataFileReader (#1639)
DataFileReader reads magic information twice. seek(0) is invoked
twice due to this. In cloud object stores, seeking back to 0 will
cause it to fall back to "random IO policy". Example of this is
S3A connector for s3. This causes suboptimal reads in object stores.
Refactoring in the patch addresses this case by reusing MAGIC.
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Priority: Major
> Labels: performance, pull-request-available
> Fix For: 1.11.1, 1.12.0
>
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)