[
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565561#comment-14565561
]
Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------
bq. Missing config key documentation in hdfs-defaults.xml
added
bq. requestBlockReportLeaseId: empty catch for unregistered node, we could add
some more informative logging rather than relying on the warn below
added
bq. I discussed the NodeData structure with Colin offline, wondering why we
didn't use a standard Collection. Colin brought up the reason of reducing
garbage, which seems valid. I think we should consider implementing
IntrusiveCollection though rather than writing another.
yes, there will be quite a few of these requests coming in at any given point.
IntrusiveCollection is an interface rather than an implementation, so I don't
think it would help here (it's most useful when an element needs to be in
multiple lists at once, and when you need fancy operations like finding the
list from the element)
bq. I also asked about putting NodeData into DatanodeDescriptor. Not sure what
the conclusion was on this, it might reduce garbage since we don't need a
separate NodeData object.
The locking is easier to understand if all the lease data is inside
{{BlockReportLeaseManager}}.
bq. I prefer Precondition checks for invalid configuration values at startup,
so there aren't any surprises for the user. Not everyone reads the messages on
startup.
ok
bq. requestLease has a check for isTraceEnabled, then logs at debug level
fixed
bq. In offerService, we ignore the new leaseID if we already have one. On the
NN though, a new request wipes out the old leaseID, and processReport checks
based on leaseID rather than node. This kind of bug makes me wonder why we
really need the leaseID at all, why not just attach a boolean to the node? Or
if it's in the deferred vs. pending list?
It's safer for the NameNode to wipe the old lease ID every time there is a new
request. It avoids problems where the DN went down while holding a lease, and
then came back up. We could potentially also avoid those problems by being
very careful with node (un)registration, but why make things more complicated
than they need to be? I do think that the DN should overwrite its old lease ID
if the NN gives it a new one, for the same reason. Let me change it to do
that... Of course this code path should never happen since the NN should never
give a new lease ID when none was requested. So calling this a "bug" seems
like a bit of a stretch.
I prefer IDs to simply checking against the datanode UUID, because lease IDs
allow us to match up the NN granting a lease with the DN accepting and using
it, which is very useful for debugging or understanding what is happening in
production. It also makes it very obvious whether a DN is "cheating" by
sending a block report with leaseID = 0 to disable rate-limiting. This is a
use-case we want to support but we also want to know when it is going on.
bq. Can we fix the javadoc for scheduleBlockReport to mention randomness, and
not "send...at the next heartbeat?" Incorrect right now.
I looked pretty far back into the history of this code. It seems to go back to
at least 2009. The underlying ideas seem to be:
1. the first full block report can have a configurable delay in seconds
expressed by {{dfs.blockreport.initialDelay}}
2. the second full block report gets a random delay between 0 and
{{dfs.blockreport.intervalMsec}}
3. all other block reports get an interval of {{dfs.blockreport.intervalMsec}}
*unless* the previous block report had a longer interval than expected... if
the previous one had a longer interval than expected, the next one gets a
shorter interval.
We can keep behavior #1... it's simple to implement and may be useful for
testing (although I think this patch makes it no longer necessary).
Behavior #2 seems like a workaround for the lack of congestion control in the
past. In a world where the NN rate-limits full block reports, we don't need
this behavior to prevent FBRs from "clumping". They will just naturally not
overly clump because we are rate-limiting them.
Behavior #3 just seems incorrect, even without this patch. By definition, a
full block report contains all the information the NN needs to understand the
DN state. Just because block report interval N was longer than expected, seems
no reason to shorten block report interval N+1. In fact, this behavior seems
like it could lead to congestion collapse... if the NN gets overloaded and
can't handle block reports for some time, a bunch of DNs will shorten the time
in between the current block report and the next one, further increasing total
NN load. Not good. Not good at all.
I replaced this with a simple "randomize first block report time within 0 and
{{dfs.blockreport.initialDelay}}, then try to do all other block reports after
{{dfs.blockreport.intervalMsec}} ms. If the full block report interval was
more than 2x what was configured, we whine about it in the log file (should
only happen if the NN is under extreme load).
bq. Have you thought about moving the BR scheduler to the NN side? We still
rely on the DNs to jitter themselves and do the initial delay, but we could
have the NN handle all this. This would also let the NN trigger FBRs whenever
it wants. We could also do better than random scheduling, i.e. stride it rather
than jitter. Incompatible, so we probably won't, but fun to think about
Yeah, we could do more on the NN to ensure fairness and so forth. I think it's
not really a big problem since datanodes aren't evil, and the existing method
of configuring BR period on the datanode side seems to be working well. We
also tend to assume uniform cluster load in HDFS, an assumption which makes
complex BR scheduling less interesting. But maybe some day...
bq. Could we do the BRLManager register/unregister in addDatanode and
removeDatanode? I think this is safe, since on a DN restart it'll provide a
lease ID of 0 and FBR, even without a reg/unreg.
Seems reasonable. I moved the BRLM register/unregister calls from
registerDatanode into addDatanode / removeDatanode.
> The DataNodes should rate-limit their full block reports by asking the NN on
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
> Key: HDFS-7923
> URL: https://issues.apache.org/jira/browse/HDFS-7923
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch,
> HDFS-7923.002.patch, HDFS-7923.003.patch
>
>
> The DataNodes should rate-limit their full block reports. They can do this
> by first sending a heartbeat message to the NN with an optional boolean set
> which requests permission to send a full block report. If the NN responds
> with another optional boolean set, the DN will send an FBR... if not, it will
> wait until later. This can be done compatibly with optional fields.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)