tommy810pp opened a new issue, #6804:
URL: https://github.com/apache/hudi/issues/6804
**Describe the problem you faced**
we are running spark job on the AWS Glue 3.0.
sometimes job has failed with this error.
```
ExecutorLostFailure (executor 3 exited caused by one of the running tasks)
Reason: Remote RPC client disassociated. Likely due to containers exceeding
thresholds, or network issues. Check driver logs for WARN messages.
```
after the failure, when upserting the records on that partition, it tries to
read already cleaned up parquet file. and it throws an exception.
```
java.io.FileNotFoundException: No such file or directory
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
```
is there any ways to remove the reference for already deleted parquet file
from the hudi table?
**To Reproduce**
Steps to reproduce the behavior:
1. spark job failed with ExecutorLostFailure during upserting the records to
the table.
2. upsert records into same partition
**Expected behavior**
after deleting the broken reference, hudi doesn't read deleted parquet file
and successfully ingest data.
**Environment Description**
* Hudi version :
0.11.1
* Spark version :
3.1.1
* Hive version :
Glue Data Catalog
* Hadoop version :
3.0.0
* Storage (HDFS/S3/GCS..) :
S3
* Running on Docker? (yes/no) :
AWS Glue 3.0
**Additional context**
**Stacktrace**
```
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 78 in stage 9.0 failed 4 times, most recent failure: Lost task
78.3 in stage 9.0 (TID 6088) (10.12.32.42 executor 16):
org.apache.hudi.exception.HoodieIOException: Failed to read from Parquet file
s3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet
at
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:181)
at
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:196)
at
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
at
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
at
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
at
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:117)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: No such file or directory
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:532)
at
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:178)
... 20 more
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2465)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2414)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2413)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2413)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2679)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2278)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$countByKey$1(PairRDDFunctions.scala:366)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at
org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:366)
at
org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:314)
at
org.apache.hudi.data.HoodieJavaPairRDD.countByKey(HoodieJavaPairRDD.java:104)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.buildProfile(BaseSparkCommitActionExecutor.java:187)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:156)
at
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.execute(BaseSparkCommitActionExecutor.java:85)
at
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:57)
... 58 more
Caused by: org.apache.hudi.exception.HoodieIOException: Failed to read from
Parquet file
s3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet
at
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:181)
at
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:196)
at
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:147)
at
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:62)
at
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
at
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:117)
at
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454)
at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: No such file or directory
's3://datalake/datasets/table/daas_date=2022-09/726c988b-4ebd-4b35-9889-15cb1363d867-0_1-23-16379_20220921161214958.parquet'
at
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:532)
at
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:300)
at
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:178)
... 20 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]