[
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=762361&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762361
]
ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------
Author: ASF GitHub Bot
Created on: 26/Apr/22 14:51
Start Date: 26/Apr/22 14:51
Worklog Time Spent: 10m
Work Description: thiru-mg commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r858812769
##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws
IOException {
this.reader = reader;
}
- /** Initialize the stream by reading from its head. */
- void initialize(InputStream in) throws IOException {
- this.header = new Header();
- this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+ byte[] readMagic() throws IOException {
+ if (this.vin == null) {
+ throw new IOException("InputStream is not initialized");
+ }
byte[] magic = new byte[DataFileConstants.MAGIC.length];
try {
vin.readFixed(magic); // read magic
} catch (IOException e) {
throw new IOException("Not an Avro data file.", e);
}
+ return magic;
+ }
+
+ void validateMagic(byte[] magic) throws InvalidAvroMagicException {
if (!Arrays.equals(DataFileConstants.MAGIC, magic))
throw new InvalidAvroMagicException("Not an Avro data file.");
+ }
+
+ /** Initialize the stream by reading from its head. */
+ void initialize(InputStream in, byte[] magic) throws IOException {
Review Comment:
I'm sorry I should have been more complete last time. Can we make these two
`initialize` functions be `private`, too?
Issue Time Tracking
-------------------
Worklog Id: (was: 762361)
Time Spent: 3h 10m (was: 3h)
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance, pull-request-available
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)