[ 
https://issues.apache.org/jira/browse/HADOOP-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369235#comment-17369235
 ] 

Arghya Saha commented on HADOOP-17755:
--------------------------------------

[[email protected]] Sorry, Let me share the information this weekend, we are 
having the issue with multiple files but it fails at the same file. We are 
working around the issue by setting fs.s3a.readahead.range to maximum of all 
the file size.. which is around 1G for us so its not a good idea.

I noticed a similar issue HADOOP-16109. on parquet and same workaround and its 
resolved in 3.3.0. Not sure if it orc gets fixed by the same fix. I am unable 
to test Spark with Hadoop 3.3.x due to below 
[https://github.com/apache/spark/pull/30135] 

> EOF reached error reading ORC file on S3A
> -----------------------------------------
>
>                 Key: HADOOP-17755
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17755
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.0
>         Environment: Hadoop 3.2.0
>            Reporter: Arghya Saha
>            Priority: Major
>
> Hi I am trying to do some transformation using Spark 3.1.1-Hadoop 3.2 on K8s 
> and using s3a
> I have around 700 GB of data to read and around 200 executors (5 vCore and 
> 30G each).
> Its able to read most of the files in problematic stage (Scan orc => Filter 
> => Project) but is failing with few files at the end with below error.  The 
> size of the file mentioned in error is around 140 MB and all other files are 
> of similar size.
> I am able to read and rewrite the specific file mentioned which suggest the 
> file is not corrupted.
> Let me know if further information is required
>  
> {code:java}
> java.io.IOException: Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orcjava.io.IOException:
>  Error reading file: 
> s3a://<bucket-with-prefix>/part-00001-5e22a873-82a5-4781-9eb9-473b483396bd.c000.zlib.orc
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1331) at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
>  at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:177)
>  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source) at java.base/java.lang.Thread.run(Unknown Source)Caused by: 
> java.io.EOFException: End of file reached before reading fully. at 
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:702) at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) 
> at 
> org.apache.orc.impl.RecordReaderUtils.readDiskRanges(RecordReaderUtils.java:566)
>  at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readFileData(RecordReaderUtils.java:285)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readPartialDataStreams(RecordReaderImpl.java:1237)
>  at 
> org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1105) 
> at 
> org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1256)
>  at 
> org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1291)
>  at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1327) 
> ... 20 more
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to