[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

mccheah Thu, 02 Nov 2017 10:43:07 -0700

Github user mccheah commented on the issue:

    https://github.com/apache/spark/pull/19468
  
    @foxish @mridulm Heads up - since the last review iteration, I wrote an 
extra test in `KubernetesClusterSchedulerBackend` that exposed a bug where if 
executors never register with the driver but end up in the error state, the 
driver doesn't attempt to replace them in subsequent batches. Test case is 
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-e56a211862434414dd307a6366d793f0R362).
 The fix is to ensure that all executors that hit an error state in our watch 
are accounted as "disconnected" executors, regardless if they were ever marked 
as disconnected by the driver endpoint otherwise - see 
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4R334).
    
    Additionally, I was able to remove one of the data structures which mapped 
[executor pod names to executor 
IDs](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4L58).
 Instead, whenever we get a Pod object, we can look up its ID via the 
executor's ID label. This makes it such that we don't have to worry about as 
much state.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

Reply via email to