[
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=755599&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-755599
]
ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------
Author: ASF GitHub Bot
Created on: 12/Apr/22 05:04
Start Date: 12/Apr/22 05:04
Worklog Time Spent: 10m
Work Description: rbalamohan opened a new pull request, #1639:
URL: https://github.com/apache/avro/pull/1639
DataFileReader reads magic information twice. seek(0) is invoked
twice due to this. In cloud object stores, seeking back to 0 will
cause it to fall back to "random IO policy". Example of this is
S3A connector for s3. This causes suboptimal reads in object stores.
Refactoring in the patch addresses this case by reusing MAGIC.
### Jira
https://issues.apache.org/jira/browse/AVRO-3482
### Tests
- Existing test cases cover this refactoring.
### Commits
### Documentation
N/A
Issue Time Tracking
-------------------
Worklog Id: (was: 755599)
Remaining Estimate: 0h
Time Spent: 10m
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance
> Time Spent: 10m
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)