Github user tdas commented on the issue:
https://github.com/apache/spark/pull/20622
@jose-torres I had a long offline chat with @zsxwing, kudos to him for
catching a corner case in the current solution. The following sequence of
events may occur.
- In the query thread, the epoch tracking thread is started
- Before the query thread actually starts the Spark job, the epoch tracking
thread may detect some sort of reconfiguration and attempt to cancelJob even
before the query thread has started spark jobs.
- Query thread starts spark job, gets blocked, never terminates.
Fundamentally, its not a great setup that one thread is starting the jobs
and another thread is canceling them. Because of the async nature, we have no
way reasoning which attempt wins, starting or cancelling. Rather let's make
sure that we start and cancel in the same thread (then we can do some
reasoning). Here is an alternate solution.
- The epoch thread ONLY interrupts the query thread. It's not responsible
for any Spark state management (other than the enum state).
- The query thread cancels jobs and stops sources in the `finally` clause.
There is less likely to be race conditions that end up not canceling Spark
job as a single thread (the query thread) is responsible for all Spark state
management.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]