[
https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran resolved HADOOP-17755.
-------------------------------------
Resolution: Duplicate
> EOF reached error reading ORC file on S3A
> -----------------------------------------
>
> Key: HADOOP-17755
> URL: https://issues.apache.org/jira/browse/HADOOP-17755
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 3.2.0
> Environment: Hadoop 3.2.0
> Reporter: Arghya Saha
> Priority: Major
>
> Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s
> and using s3a
> I have around 700 GB of data to read and around 200 executors (5 vCore and
> 30G each).
> Its able to read most of the files in problematic stage (Scan orc => Filter
> => Project) but is failing with few files at the end with below error. The
> size of the file mentioned in error is around 140 MB and all other files are
> of similar size.
> I am able to read and rewrite the specific file mentioned which suggest the
> file is not corrupted.
> Let me know if further information is required
>
> {code:java}
> java.io.IOException: Error reading file:
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
> Error reading file:
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
> at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> at
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
> at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at
> org.apache.spark.scheduler.Task.run(Task.scala:131) at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by:
> java.io.EOFException: End of file reached before reading fully. at
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
> at
> org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
> at
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
> at
> org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
> at
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105)
> at
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256)
> at
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
> at
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327)
> ... 20 more
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]