tommy810pp opened a new issue, #6804:
URL: https://github.com/apache/hudi/issues/6804

   **Describe the problem you faced**
   we are running spark job on the AWS Glue 3.0. 
   
   sometimes job has failed with this error.
   ```
   ExecutorLostFailure (executor 3 exited caused by one of the running tasks) 
Reason: Remote RPC client disassociated. Likely due to containers exceeding 
thresholds, or network issues. Check driver logs for WARN messages.
   ```
   
   after the failure, when upserting the records on that partition, it tries to 
read already cleaned up parquet file. and it throws an exception.
   ```
   java.io.FileNotFoundException: No such file or directory 
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
   ```
   
   is there any ways to remove the reference for already deleted parquet file 
from the hudi table?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. spark job failed with ExecutorLostFailure during upserting the records to 
the table.
   2.  upsert records into same partition
   
   **Expected behavior**
   after deleting the broken reference, hudi doesn't read deleted parquet file 
and successfully ingest data.
   
   **Environment Description**
   
   * Hudi version :
   0.11.1
   
   * Spark version :
   3.1.1
   
   * Hive version :
   Glue Data Catalog
   
   * Hadoop version :
   3.0.0
   
   * Storage (HDFS/S3/GCS..) :
   S3
   
   * Running on Docker? (yes/no) :
   AWS Glue 3.0
   
   **Additional context**
   
   
   **Stacktrace**
   
   ```
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 78 in stage 9.0 failed 4 times, most recent failure: Lost task 
78.3 in stage 9.0 (TID 6088) (10.12.32.42 executor 16): 
org.apache.hudi.exception.HoodieIOException: Failed to read from Parquet file 
s3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:181)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:196)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
        at 
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
        at 
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
        at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:117)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
        at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:532)
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
        at 
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:178)
        ... 20 more
   
   Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2465)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2414)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2413)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2413)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
        at scala.Option.foreach(Option.scala:257)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2679)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
        at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
        at 
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
        at 
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
        at 
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:104)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:187)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:156)
        at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
        at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:57)
        ... 58 more
   Caused by: org.apache.hudi.exception.HoodieIOException: Failed to read from 
Parquet file 
s3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:181)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:196)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
        at 
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
        at 
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
        at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:117)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
        at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:532)
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
        at 
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:178)
        ... 20 more
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to