[
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688857#comment-17688857
]
ASF GitHub Bot commented on HDFS-16918:
---------------------------------------
virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430828452
> If the datanode is connected to observer namenode, it can serve requests,
why we need to shutdown
The observer namenode takes a different condition. I was actually thinking
about making this include observer namenode too i.e. if datanode has not
received heartbeat from observer or active namenode in the last e.g. 30s or so,
then it should shutdown. This is an option, no issues with it.
> Even if it is connected to standby, a failover happens and it will be in
good shape, else if you restart a bunch of datanodes, the new namenode will be
flooded by block reports and just increasing problems.
This problem would occur only if we select reasonably lower number. The
recommendation for this config value is high enough to include extra time
duration for namenode failover.
> If something gets messed up with Active namenode, you shutdown all, the BR
are already heavy, you forced all other namenodes to handle them again, making
failover more difficult. and if it is some faulty datanodes which lost
connection, you didn't get that alarmed, and all Standby and Observers will
keep on getting flooded by BRs, so in case Active NN literally dies and tries
to failover to any of the Namenode which these Datanodes were connected, will
be fed with unnecessary loads of BlockReports. (BR has an option of initial
delay as well, it isn't like all bombard at once and you are sorted in 5-10
mins)
The moment when active namenode becomes messy, or dies, this is exactly what
can impact the availability of the hdfs service. So either we have Observer
namenode take care of read requests in the meantime or the failover needs to
happen. If neither of that happens, it's the datanode that is not really useful
by staying the in cluster for longer duration. Let's say namenode gets bad and
failover does take time, the new active one is anyways going to take time
processing BRs right?
> If something got messed with the datanode, that is why it isn't able to
connect to Active. If something is in Memory not persisted to disk, or some JMX
parameter or N/W parameters which can be used to figure out things gets lost.
Do you mean hsync vs hflush kind of thing for in prgress files? Is that not
already taken care of?
> That is the reason most cluster administrator in not so cool situations,
show XYZ datanode is unhealthy or not, if in some case they don't it should be
handled over there.
The response would take time from the cluster admin applications. Why not
get auto healed by datanode? Also it's not that this change is going to
terminate the datanode, it's going to shut down properly.
> In case of shared datanodes in a federated setup, say it is connected to
Active for one Namespace and has completely lost touch with another, then?
Restart to get both working? Don't restart so that at least one stays working?
Both are correct in there own ways and situation and the datanode shouldn't be
in a state to decide its fate for such reasons.
IMO any namespace that is not connected to active namenode is not up for
serving requests from active namenode and hence it's not in good state. I got
your point but the health of a datanode should be determined based on whether
all BPs are connected to active in the federated setup, is that not the real
factor determining the health of datanode?
> Making anything configurable doesn't justify having it in. if we are
letting any user to use this via any config as well, then we should be sure
enough it is necessary and good thing to do, we can not say ohh you configured
it, now it is your problem...
I am not making claim only based on making this configurable feature. But it
is reasonable enough to determine best course of action for given situation.
The only recommendation I have is: user should be able to get the datanode to
decide whether it should shutdown gracefully when it has not heard anything
from active or observer namenode for the past x sec (50/60s or so).
I have tried my best to answer above questions. Please also take a look at
the Jira/PR description where this idea has been taken from. We have seen
issues with specific infra and until manually shutting down datanodes, we don't
see any hope for improving availability, this has happened at multiple times.
Please keep in mind that cluster administrators in cloud native env do not
have access to JMX metrics due to the security constraints.
Really appreciate all your points and suggestions Ayush, please take a look
again.
> Optionally shut down datanode if it does not stay connected to active namenode
> ------------------------------------------------------------------------------
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout
> configured at envoy, the network connection issues or packet loss could be
> observed. All of envoys basically form a transparent communication mesh in
> which each app can send and receive packets to and from localhost and is
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to
> applications, in order to identify network issues reliably. However,
> sometimes such proxy based setup could result into socket connection issues
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the
> hadoop daemons are stopped. If a given datanode does not stay connected to
> active namenode in the cluster i.e. does not receive heartbeat response in
> time from active namenode (even though active namenode is not terminated), it
> would not be much useful. We should be able to provide configurable behavior
> such that if a given datanode cannot receive heartbeat response from active
> namenode in configurable time duration, it should terminate itself to avoid
> impacting the availability SLA. This is specifically helpful when the
> underlying deployment or observability framework (e.g. K8S) can start up the
> datanode automatically upon it's shutdown (unless it is being restarted as
> part of rolling upgrade) and help the newly brought up datanode (in case of
> k8s, a new pod with dynamically changing nodes) establish new socket
> connection to active and standby namenodes. This should be an opt-in behavior
> and not default one.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]