Davis-Zhang-Onehouse opened a new pull request, #12358: URL: https://github.com/apache/hudi/pull/12358
### Change Logs because in hudi HoodieSparkEngineContext the jobGroupId is set as activeModule , so when the query cancelling happens and thrift server tries to find active jobs with the SQL statement id and nothing will be found. Any users including OSS using spark-hudi have such issue, not just OH. This issue can be simply avoided by not overriding jobGroupId with activeModule. Initial state: there is a spark job inflight, enter Ctrl+C in beeline  After the change: Job group of spark task and query statement id are the same the inflight job will be aborted as soon as the interruption is acked, which can be told by 1. the failed job has 0 task completed as I stop the job at the very beginning of the execution. 2. The inflight job end up as a "Failed job". Also I can see the job execution throws execption whose case is "InterruptedException" which is caused by thread.cancel(). (the 1st figure) Also no new jobs will be scheduled, which can be proved by no new jobs scheduled after the failed job occurs (the 1st figure). Plus the same uninterrupted query with no interruption, there are much more completed jobs (the 3rd figure).    Before: Delivering the query interruption at the same spot where we have a new spark job inflight, we saw: [Same as After] No new jobs got scheduled after the interruption is acked. [Diff from After] For the inflight one it continues to execute until becomes a "Completed Job". [Diff from After] Each spark job group id is overridden as the hudi module that is currently running, instead of using the SQL statement Id. (start of spark jobs)  (after cancellation) the interrupted task is shown as a "Completed job"  Other coverage: When there is inflight spark job, terminate beeline connection - It's the same as Ctrl+C. If localhost port-forward is killed (simulating connection loss), the behavior is the same as cancelling query with Ctrl+C I see logs when spark lost connection with client the relevant job will be cleaned up ``` 24/11/27 11:01:54 INFO Executor: Executor killed task 0.0 in stage 1.0 (TID 1), reason: Stage cancelled 24/11/27 11:01:54 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (10.0.0.72 executor driver): TaskKilled (Stage cancelled) 24/11/27 11:01:54 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 24/11/27 11:01:55 INFO DAGScheduler: Asked to cancel job group ebecc7ab-14f4-4d17-9060-7b7ed054fe63 24/11/27 11:01:55 WARN SparkExecuteStatementOperation: Ignore exception in terminal state with ebecc7ab-14f4-4d17-9060-7b7ed054fe63: org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table <--- the cause of this exception is thread interruption ``` ### Impact When we cancel spark sql queries, either by disconnecting the client from network or ctrl+C when running queries over tools like beeline, it will kill the entire query including the inflight spark job. Without this change, when cancelling queries, it will not exit until the current spark job finishes. ### Risk level (write none, low medium or high below) None ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
