[GitHub] [hudi] jiangxinqi1995 opened a new issue, #8276: [SUPPORT] Flink Exceeded checkpoint tolerable failure threshold.

via GitHub Thu, 23 Mar 2023 04:17:20 -0700


jiangxinqi1995 opened a new issue, #8276:
URL: https://github.com/apache/hudi/issues/8276


   **Describe the problem you faced**
   
   A clear and concise description of the problem.
     "I use Flink cdc to read MySQL data, and then write it to S3 through hudi. 
I often encounter checkpoint org.apache.Flink.util.FlinkRuntimeException: 
Exceeded checkpoint tolerable failure threshold." "The common problem is that a 
checkpoint failure occurs every 20 minutes. I have no problems running on a 
local machine, but when I go to an EKS cluster, this problem occurs."
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. 
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   ```
   2023-03-23 10:23:07,101 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor 
executes action [handle write metadata event for instant 20230323101927464] 
success!
   2023-03-23 10:23:07,817 INFO  
org.apache.flink.fs.s3.common.writer.S3Committer             [] - Committing 
reject/savepoint-fbea13-af1aa20f0400/_metadata with MPU ID 
rqLctlP9RnUBjWedNLI1bXhNB32evfVDwi7T1nNz8Gd9gDzFtYDRm615A5MCQivzMav.9yAJeD_Tp36Yp52oHitnmMZ6BqYmJV9G.JcDXEco.czZfXHRzGisnIuVzO._qSlcCQaQpSW.Qx.X0ex3LQ--
   2023-03-23 10:23:08,065 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 11 for job fbea139434199b0095f544fe5c15d25f (747958 bytes, 
checkpointDuration=6668 ms, finalizationTime=0 ms).
   2023-03-23 10:24:06,082 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
checkpoint 12 (type=CheckpointType{name='Checkpoint', 
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1679567046082 for job 
fbea139434199b0095f544fe5c15d25f.
   2023-03-23 10:24:06,765 INFO  
org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor 
executes action [taking checkpoint 12] success!
   2023-03-23 10:34:06,083 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 12 
of job fbea139434199b0095f544fe5c15d25f expired before completing.
   2023-03-23 10:34:06,083 WARN  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger or complete checkpoint 12 for job fbea139434199b0095f544fe5c15d25f. (0 
consecutive failed attempts so far)
   org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint expired 
before completing.
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2143)
 [flink-dist-1.15.3.jar:1.15.3]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
[?:?]
        at java.lang.Thread.run(Unknown Source) [?:?]
   2023-03-23 10:34:06,084 INFO  org.apache.flink.runtime.jobmaster.JobMaster   
              [] - Trying to recover from a global failure.
   org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable 
failure threshold.
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:206)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2082)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2061)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:98)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2143)
 ~[flink-dist-1.15.3.jar:1.15.3]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
~[?:?]
        at java.lang.Thread.run(Unknown Source) ~[?:?]
   ```
   
   
![image](https://user-images.githubusercontent.com/86709333/227184418-e018f52a-b6af-4284-be3f-ec8928acc4f0.png)
   
![image](https://user-images.githubusercontent.com/86709333/227187512-763ec0fa-c088-42bb-8633-ade953d8ff9b.png)
   
![image](https://user-images.githubusercontent.com/86709333/227185484-89beb2c7-1e4b-4c93-9d5f-b73a79a0bb00.png)
   
![image](https://user-images.githubusercontent.com/86709333/227185639-a158f340-0d5b-4de3-af9e-07100e2df406.png)
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   * Flink version：1.15.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :yes ,EKS
   
   
   **Additional context**
                  ```
    connector = hudi, 
    table.type = MERGE_ON_READ, 
    hoodie.clean.async=true, 
    hoodie.compact.inline= true,  
    hoodie.compact.inline.max.delta.commits=2, 
    hoodie.clean.max.commits=2, 
    hoodie.cleaner.commits.retained = 3, 
    hoodie.cleaner.policy = KEEP_LATEST_COMMITS, 
    hoodie.parquet.small.file.limit=104857600, 
    clustering.schedule.enabled=true, 
    clustering.async.enabled=true, 
    hoodie.clustering.inline= true, 
    hoodie.clustering.inline.max.commits= 2, 
    hoodie.clustering.plan.strategy.max.bytes.per.group= 107374182400, 
    hoodie.clustering.plan.strategy.max.num.groups= 1, 
    hoodie.datasource.write.recordkey.field = id,installs,rejects, 
    path = s3a://xxxxxxxx/xxxxxxx
   
   ```
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jiangxinqi1995 opened a new issue, #8276: [SUPPORT] Flink Exceeded checkpoint tolerable failure threshold.

Reply via email to