clp007 opened a new issue, #8681:
URL: https://github.com/apache/hudi/issues/8681

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   The task failed to be written into the hudi table. I started a new task, but 
part of the failed data was written to the gcs, and the new task failed to roll 
back the failed data. Therefore, there are duplicate data in the 2023-04-15 
partition of this table
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Write a copy of data to hudi. There are duplicate data in this partition
   2. 
   3. 
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   0.13.0
   * Spark version :
   3.2.3
   * Hive version :
   
   * Hadoop version :
   3.3.0
   * Storage (HDFS/S3/GCS..) :
   GCS 
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   dataproc serveless
   
   **Stacktrace**
   
   10 14:54:16 WARN TaskSetManager: Lost task 70.0 in stage 8.0 (TID 3655) 
(10.128.1.34 executor 217): org.apache.hudi.exception.HoodieIOException: Failed 
to read from Parquet file 
gs://bq-events/hudi_ods/local-news-5749b/events/event_date=2023-04-15/event_name=analysis_referrer/91f0b7de-08c2-4edc-8181-5d4cc4d8b247-0_7-13-2303_20230417074321965.parquet
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:182)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:197)
        at 
org.apache.hudi.common.util.ParquetUtils.fetchHoodieKeys(ParquetUtils.java:148)
        at 
org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:61)
        at 
org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$33972fb4$1(HoodieSimpleIndex.java:155)
        at 
org.apache.hudi.data.HoodieJavaRDD.lambda$flatMap$a6598fcb$1(HoodieJavaRDD.java:123)
        at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.io.FileNotFoundException: File not found: 
gs://bq-events/hudi_ods/local-news-5749b/events/event_date=2023-04-15/event_name=analysis_referrer/91f0b7de-08c2-4edc-8181-5d4cc4d8b247-0_7-13-2303_20230417074321965.parquet
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:961)
        at 
org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:347)
        at 
org.apache.hudi.common.util.ParquetUtils.getHoodieKeyIterator(ParquetUtils.java:179)
        ... 20 more
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to