[PR] [HUDI-8596] Fix hudi spark sql cancelling issue [hudi]

via GitHub Wed, 27 Nov 2024 16:56:31 -0800


Davis-Zhang-Onehouse opened a new pull request, #12358:
URL: https://github.com/apache/hudi/pull/12358


   ### Change Logs
   
   because in hudi HoodieSparkEngineContext the jobGroupId is set as 
activeModule  , so when the query cancelling happens and thrift server tries to 
find active jobs with the SQL statement id and nothing will be found.
   
   Any users including OSS using spark-hudi have such issue, not just OH.
   This issue can be simply avoided by not overriding jobGroupId with 
activeModule.
   
   Initial state: there is a spark job inflight, enter Ctrl+C in beeline
   
![image](https://github.com/user-attachments/assets/394c6cf8-86bb-40e2-ab8f-732ffb1449f2)
   
   
   After the change: 
   Job group of spark task and query statement id are the same
   the inflight job will be aborted as soon as the interruption is acked, which 
can be told by 1. the failed job has 0 task completed as I stop the job at the 
very beginning of the execution. 2. The inflight job end up as a "Failed job". 
Also I can see the job execution throws execption whose case is 
"InterruptedException" which is caused by thread.cancel(). (the 1st figure)
   Also no new jobs will be scheduled, which can be proved by no new jobs 
scheduled after the failed job occurs (the 1st figure). Plus the same 
uninterrupted query with no interruption, there are much more completed jobs 
(the 3rd figure).
   
   
![image](https://github.com/user-attachments/assets/d4609578-f774-4b53-b5bd-fdf0a71c2b31)
   
![image](https://github.com/user-attachments/assets/0ced873c-7053-4d55-aad3-f0e195737440)
   
![image](https://github.com/user-attachments/assets/6aefbebe-97c0-47cb-922b-4c12f368b23c)
   
   
   
   Before:
   Delivering the query interruption at the same spot where we have a new spark 
job inflight, we saw:
   [Same as After] No new jobs got scheduled after the interruption is acked.
   [Diff from After] For the inflight one it continues to execute until becomes 
a "Completed Job".
   [Diff from After] Each spark job group id is overridden as the hudi module 
that is currently running, instead of using the SQL statement Id.
   (start of spark jobs)
   
![image](https://github.com/user-attachments/assets/38bd8010-20a7-4e59-9812-3f9bb18330a7)
   
   (after cancellation) the interrupted task is shown as a "Completed job"
   
![image](https://github.com/user-attachments/assets/bc8bbccc-71fe-4796-96f0-7d95394095b8)
   
   
   Other coverage:
   When there is inflight spark job, terminate beeline connection - It's the 
same as Ctrl+C.
   If localhost port-forward is killed (simulating connection loss), the 
behavior is the same as cancelling query with Ctrl+C
   I see logs when spark lost connection with client the relevant job will be 
cleaned up
   ```
   24/11/27 11:01:54 INFO Executor: Executor killed task 0.0 in stage 1.0 (TID 
1), reason: Stage cancelled
   24/11/27 11:01:54 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) 
(10.0.0.72 executor driver): TaskKilled (Stage cancelled)
   24/11/27 11:01:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks 
have all completed, from pool 
   24/11/27 11:01:55 INFO DAGScheduler: Asked to cancel job group 
ebecc7ab-14f4-4d17-9060-7b7ed054fe63
   24/11/27 11:01:55 WARN SparkExecuteStatementOperation: Ignore exception in 
terminal state with ebecc7ab-14f4-4d17-9060-7b7ed054fe63: 
org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table 
<--- the cause of this exception is thread interruption
   ```
   ### Impact
   
   When we cancel spark sql queries, either by disconnecting the client from 
network or ctrl+C when running queries over tools like beeline, it will kill 
the entire query including the inflight spark job.
   
   Without this change, when cancelling queries, it will not exit until the 
current spark job finishes.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-8596] Fix hudi spark sql cancelling issue [hudi]

Reply via email to