Github user squito commented on the issue:
https://github.com/apache/spark/pull/21068
> actually the only other thing I need to make sure is there aren't any
delays if we now send the information from yarn allocator back to scheduler and
then I assume it would need to get it back again from scheduler. During that
the yarn allocator could be calling allocate() and updating things. So we need
to make sure it gets the most up to date blacklist.
> also I need to double check but the blacklist information isn't being
sent to the yarn allocator when dynamic allocation is off right? We would want
that to happen.
yeah both good points. actually, don't we want to update the general node
blacklist on the yarn allocator even when dynamic allocation is off? I don't
think it gets updated at all unless dynamic allocation is on, it seems all the
updates originate in `ExecutorAllocationManager`, the blacklist never actively
pushes updates to the yarn allocator. That seems like an existing shortcoming.
> do you know if mesos and/or kubernetes can provide this same information?
I don't know about kubernetes at all. Mesos does provide info when a
container fails. I don't think it lets you know the total cluster size, but
that should be optional. Btw, node count is never going to be totally
sufficient, as the remaining nodes might not actually be able to run your
executors (smaller hardware, always taken up by higher priority applications,
other constraints in a framework like mesos), its always going to be best
effort.
@attilapiros and I discussed this briefly yesterday, an alternative to
moving everything into the BlacklistTracker on the driver is to just have some
abstract base class, which is changed slightly for each cluster manager. Then
you could keep the flow like it is here, with the extra blacklisting living in
YarnAllocator still.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]