EOFException when performing ORC Reads on AWS S3 using s3a://

Pavan Lanka Thu, 30 Sep 2021 09:41:23 -0700

Wanted to share this information in case anyone else runs into a similar 
problem.


Problem
——————————————
I was getting the following exception when an ORC Read was taking place
```text
Caused by: java.io.IOException: Problem opening stripe 0 footer in s3a://<snip>.
 at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:349)
 at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:878)
 at org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:125)
 ... 24 more
Caused by: java.io.EOFException: End of file reached before reading fully.
 at org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702)
 at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
 at org.apache.orc.impl.RecordReaderUtils.readRanges(RecordReaderUtils.java:417)
 at 
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:484)
 at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:102)
 at org.apache.orc.impl.reader.StripePlanner.readData(StripePlanner.java:177)
 at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1210)
 at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1250)
 at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1293)
 at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:344)
 ... 26 more
```

In terms of context the following was used:
* Apache Spark 3.1.2
* Apache Iceberg 0.11.1
* Apache Hadoop 3.2.0
* Apache ORC 1.7.0
* The Iceberg tables were served out of AWS S3 using the S3AFileSystem from 
`hadoop-aws`

The failure was encountered when a join was taking place on two ORC Tables. It 
was reproducible and failing on the same set of files. However a similar read 
on the individual table did not result in this failure

Solution
————————————
We validated that the ORC Read planning was proper by enhancing the exception 
message to include the offset and length. Once this was confirmed and we 
started exploring the `hadoop-aws` artifact that has the S3AFileSystem and the 
S3AInputStream.

Came across [HADOOP-16109][1] and is well documented on the source of this 
issue. Upon upgrade of Apache Hadoop to 3.2.1 which includes this patch the 
failure is no longer encountered.

Hope this helps.

Regards,
Pavan

[1]: https://issues.apache.org/jira/browse/HADOOP-16109

EOFException when performing ORC Reads on AWS S3 using s3a://

Reply via email to