[
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=757939&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-757939
]
ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------
Author: ASF GitHub Bot
Created on: 18/Apr/22 15:48
Start Date: 18/Apr/22 15:48
Worklog Time Spent: 10m
Work Description: steveloughran commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r852210974
##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader12.java:
##########
@@ -61,6 +61,7 @@ public DataFileReader12(SeekableInput sin, DatumReader<D>
reader) throws IOExcep
this.in = new DataFileReader.SeekableInputStream(sin);
byte[] magic = new byte[4];
+ in.seek(0); // seek to 0 to read magic header
in.read(magic);
Review Comment:
what if fewer bytes come back? why not a readFully()?
##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws
IOException {
this.reader = reader;
}
- /** Initialize the stream by reading from its head. */
- void initialize(InputStream in) throws IOException {
- this.header = new Header();
- this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+ byte[] readMagic() throws IOException {
+ if (this.vin == null) {
+ throw new IOException("InputStream is not initialized");
+ }
byte[] magic = new byte[DataFileConstants.MAGIC.length];
try {
vin.readFixed(magic); // read magic
} catch (IOException e) {
throw new IOException("Not an Avro data file.", e);
Review Comment:
always nice to include the error string of the nested exception, as there
may be many other causes of problems than just file type
Issue Time Tracking
-------------------
Worklog Id: (was: 757939)
Time Spent: 1h (was: 50m)
> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Priority: Major
> Labels: performance, pull-request-available
> Time Spent: 1h
> Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
> int bytesRead = in.read(magic, offset, length);
> if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes
> remaining to read");
> length -= bytesRead;
> offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
> return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
> return new DataFileReader12<>(in, reader);
>
> {code}
>
> With cloud stores, this can turn out to be expensive as the stream has to be
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it
> on to DataFileReader / DataFileReader12. This will ensure that, file can be
> read in sequential manner in cloud stores and help in reducing IO calls.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)