Arghya Saha created HADOOP-17755:
------------------------------------

             Summary: EOF reached error reading ORC file on S3A
                 Key: HADOOP-17755
                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 3.2.0
         Environment: Hadoop 3.2.0
            Reporter: Arghya Saha


Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
and using s3a

I have around 700 GB of data to read and around 200 executors (5 vCore and 30G 
each).

Its able to read most of the files in problematic stage (Scan orc => Filter => 
Project) but is failing with few files at the end with below error. 

I am able to read and rewrite the specific file mentioned which suggest the 
file is not corrupted.

Let me know if further information is required

 
{code:java}
java.io.IOException: Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
 Error reading file: 
s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) 
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
 at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
 at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
 at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at 
java.base/java.lang.Thread.run(Unknown Source)Caused by: java.io.EOFException: 
End of file reached before reading fully. at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) at 
org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
 at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
 at 
org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
 at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256) 
at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
 at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
... 20 more
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to