[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284681#comment-14284681
 ] 

Colin Patrick McCabe commented on HDFS-7575:
--------------------------------------------

So there are two approaches here:
1. silently (i.e., without user intervention), dedupe duplicate storage IDs 
when starting up the DataNode
2. create a new DataNode layout version and dedupe duplicate storage IDs during 
the upgrade.

Arguments in favor of approach #1:
* Collisions might happen that we need to dedupe repeatly.  This argument seems 
specious since the probability is effectively less than the change of cosmic 
rays causing errors (as Nicholas pointed out).  I think the probabilities 
outlined here make this argument a non-starter: 
https://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates.
  Also, approach #1 only dedupes on a single datanode, but there can be many 
datanodes in the cluster.

* As Suresh pointed out, the old software can easily handle cases where the 
Storage IDs are unique.  So using a new layout version is not required to flip 
back and forth between old and new software.  While this is true, we have 
bumped the layout version in the past even when the old software could handle 
the new layout.  For example, HDFS-6482 added a new DN layout version even 
though the old software could use the new blockid-based layout.  So this 
argument is basically just saying "approach #1 is viable."  But it doesn't tell 
us whether approach #1 is a good idea.

* Nobody has made this argument yet, but you could argue that the upgrade 
process will be faster with approach #1 than approach #2.  However, we've done 
datanode layout version upgrades on production clusters in the past and time 
hasn't been an issue.  The JNI hardlink code (and soon, the Java7 hardlink 
code) eliminated the long delays that resulted from spawning shell commands.  
So I don't think this argument is persuasive.

Arguments in favor of approach #2:
* Changing the storage ID during startup basically changes storage ID from 
being a permanent identifier to a temporary one.  This seems like a small 
change, but I would argue that it's really a big one, architecturally.  For 
example, suppose we wanted to persist this information at some point.  We 
couldn't really do that if it's changing all the time.

* With approach #1, we have to carry the burden of the dedupe code forever.  We 
can't ever stop deduping, even in Hadoop 3.0, because for all we know, the user 
has just upgraded, and was previously running 2.6 (a version with the bug) that 
we will have to correct.  The extra run time isn't an issue, but the complexity 
is.  What if our write to VERSION fails on one of the volume directories?  What 
do we do then?  And then if volume failures are tolerated, this directory could 
later come back and be an issue.  The purpose of layout versions is so that we 
don't have to think about these kind of "mix and match" issues.

* Approach #1 leaves us open to some weird scenarios.  For example, what if I 
have /storage1 -> /foo and /storage2 -> /foo.  In other words, you have what 
appears to be two volume root directories, but it's really the same directory.  
Approach #2 will complain, but approach #1 will happily rename the storageID of 
the /foo directory and continue with the corrupt configuration.  This is what 
happens when you fudge error checking.

So in conclusion I would argue for approach #2.  Thoughts?

> NameNode not handling heartbeats properly after HDFS-2832
> ---------------------------------------------------------
>
>                 Key: HDFS-7575
>                 URL: https://issues.apache.org/jira/browse/HDFS-7575
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0
>            Reporter: Lars Francke
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
> HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
> HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
> testUpgrade22via24GeneratesStorageIDs.tgz, 
> testUpgradeFrom22GeneratesStorageIDs.tgz, 
> testUpgradeFrom24PreservesStorageId.tgz
>
>
> Before HDFS-2832 each DataNode would have a unique storageId which included 
> its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
> storage directory which is just a random UUID.
> They send reports per storage directory in their heartbeats. This heartbeat 
> is processed on the NameNode in the 
> {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
> just store the information per Datanode. After the patch though each DataNode 
> can have multiple different storages so it's stored in a map keyed by the 
> storage Id.
> This works fine for all clusters that have been installed post HDFS-2832 as 
> they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
> different keys. On each Heartbeat the Map is searched and updated 
> ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
> {code:title=DatanodeStorageInfo}
>   void updateState(StorageReport r) {
>     capacity = r.getCapacity();
>     dfsUsed = r.getDfsUsed();
>     remaining = r.getRemaining();
>     blockPoolUsed = r.getBlockPoolUsed();
>   }
> {code}
> On clusters that were upgraded from a pre HDFS-2832 version though the 
> storage Id has not been rewritten (at least not on the four clusters I 
> checked) so each directory will have the exact same storageId. That means 
> there'll be only a single entry in the {{storageMap}} and it'll be 
> overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
> in the {{updateState}} method above. This just assigns the capacity from the 
> received report, instead it should probably sum it up per received heartbeat.
> The Balancer seems to be one of the only things that actually uses this 
> information so it now considers the utilization of a random drive per 
> DataNode for balancing purposes.
> Things get even worse when a drive has been added or replaced as this will 
> now get a new storage Id so there'll be two entries in the storageMap. As new 
> drives are usually empty it skewes the balancers decision in a way that this 
> node will never be considered over-utilized.
> Another problem is that old StorageReports are never removed from the 
> storageMap. So if I replace a drive and it gets a new storage Id the old one 
> will still be in place and used for all calculations by the Balancer until a 
> restart of the NameNode.
> I can try providing a patch that does the following:
> * Instead of using a Map I could just store the array we receive or instead 
> of storing an array sum up the values for reports with the same Id
> * On each heartbeat clear the map (so we know we have up to date information)
> Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to