KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466
> hence if few nodes are really in bad state (hardware/network issues), the
plan is to keep re-queueing them until more nodes are getting decommissioned
than max tracked nodes right?
It's the opposite, the unhealthy nodes will only be re-queued when there are
more nodes being decommissioned than max tracked nodes. Otherwise, if there are
fewer nodes being decommissioned than max tracked nodes, then the unhealthy
nodes will not be re-queued because they do not risk blocking the
decommissioning of queued healthy nodes (i.e. because the queue is empty).
One potential performance impact that comes to mind is that if there are say
200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may
cause some churn in the queueing/de-queueing process because each
DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100
queued nodes will be de-queued/tracked. Note that this churn (and any
associated performance impact) will only take effect when:
- there are more nodes being decommissioned than max tracked nodes
- AND either:
- number of healthy decommissioning nodes < max tracked nodes
- number of unhealthy decommissioning nodes > max tracked nodes
The amount of re-queued/de-queued nodes per tick can be quantified as:
`numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning
- (numDecommissioning - numTracked)`
This churn of queueing/de-queueing will not occur at all under typical
decommissioning scenarios (i.e. where there isn't a large number of dead
decommissioning nodes).
One idea to mitigate this is to have DatanodeAdminMonitor maintain counters
used to track the number of healthy nodes in the pendingNodes queue; then this
count can be used to make an improved re-queue decision. In particular,
unhealthy nodes are only re-queued if there are healthy nodes in the
pendingNodes queue. But this approach has some flaws, for example an unhealthy
node in the queue could come alive again, but then an unhealthy node in the
tracked set wouldn't be re-queued because the healthy queued node count hasn't
been updated. To solve this, we would need to scan the pendingNodes queue to
update the healthy/unhealthy node counts periodically, this scan could prove
expensive.
> Since unhealthy node getting decommissioned might anyways require some
sort of retry, shall we requeue them even if the condition is not met (i.e.
total no of decomm in progress < max tracked nodes) as a limited retries?
If there are fewer nodes being decommissioned than max tracked nodes, then
there are no nodes in the pendingNodes queue & all nodes are being tracked for
decommissioning. Therefore, there is no possibility that any healthy nodes are
blocked in the pendingNodes queue (preventing them from being decommissioned) &
so in my opinion there is no benefit to re-queueing the unhealthy nodes in this
case. Furthermore, this will negatively impact performance through frequent
re-queueing & de-queueing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]