Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/19468
@foxish @mridulm Heads up - since the last review iteration, I wrote an
extra test in `KubernetesClusterSchedulerBackend` that exposed a bug where if
executors never register with the driver but end up in the error state, the
driver doesn't attempt to replace them in subsequent batches. Test case is
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-e56a211862434414dd307a6366d793f0R362).
The fix is to ensure that all executors that hit an error state in our watch
are accounted as "disconnected" executors, regardless if they were ever marked
as disconnected by the driver endpoint otherwise - see
[here](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4R334).
Additionally, I was able to remove one of the data structures which mapped
[executor pod names to executor
IDs](https://github.com/apache/spark/pull/19468/commits/4b3213422e6e67b11de7b627ad46d4031043be0e#diff-cb0798d511ec5504fc282407c993d5d4L58).
Instead, whenever we get a Pod object, we can look up its ID via the
executor's ID label. This makes it such that we don't have to worry about as
much state.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]