RajasekarSribalan opened a new issue #2098:
URL: https://github.com/apache/hudi/issues/2098


   Hi Team,
   
   We are  getting Parquet not found error while reading a Hudi table from 
Spark.
   
   What we do?
   
   We read a Hudi table from Spark(Select * from table) and do an insert and 
overwrrite on another hive table.Basically we take the snapshot of Hudi table 
and insert it in another table. Last couple of days, we are not able to read 
the entire table from Spark. We are getting bellow error.
   
   Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://baikal-prod/user/tempuser/hudi/hudipath/tbale_name/343f7bec-e29d-4b1e-a429-463c8efb09fb-0_91-23390-2236222_20200921075346.parquet
       at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
       at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
       at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
       at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
       at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
       at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
       at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
       at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
       at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
       at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:297)
       at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246)
       at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:245)
       at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203)
       at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
       at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
       at org.apache.spark.scheduler.Task.run(Task.scala:108)
       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
   
   
   Note:  Parallely we have Streaming job which populating the source hudi 
table.
   
   How to fix these kind of issues? Pls assist. If there are any Hudi metadata 
problem? It was running perfectly for these many days but not sure why it 
throws file not found error. Whether cleaner didnot work prooperly? Pls help


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to