[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270423#comment-14270423
 ] 

Colin Patrick McCabe commented on HDFS-7575:
--------------------------------------------

Hi Arpit,

Thanks for taking this one on.  Is there any chance that you could do the patch 
without adding or moving binary files?  It seems like the main thing we're 
testing here is just that when we start up, we're going to modify the VERSION 
files as expected.  We shouldn't even need any block files to test that, right? 
 Just a few mkdirs in a unit test.

If we check in the existing code, this 1.7 MB commit becomes part of the repo's 
history forever which slows down downloads and git pulls.  I also think that 
untarring things during a test is kind of sluggish as well.

It seems like we never needed these tar files to begin with.  We could just 
have the test open up the txt files and generate a temporary directory based on 
them.  If you want the blocks to have contents, we could just generate them 
with a fixed random seed using java.util.Random, and always get the same 
contents.

{code}
     if (this.layoutVersion > HdfsConstants.DATANODE_LAYOUT_VERSION) {
+
+      // Clusters previously upgraded from layout versions earlier than
+      // ADD_DATANODE_AND_STORAGE_UUIDS failed to correctly generate a
+      // new storage ID. We fix that now.
+
+      boolean haveValidStorageId =
+          DataNodeLayoutVersion.supports(
+              LayoutVersion.Feature.ADD_DATANODE_AND_STORAGE_UUIDS, 
layoutVersion) &&
+              DatanodeStorage.isValidStorageId(sd.getStorageUuid());
+
       doUpgrade(datanode, sd, nsInfo);  // upgrade
-      createStorageID(sd);
+      if (createStorageID(sd, !haveValidStorageId)) {
+        LOG.info("Generated new storageID " + sd.getStorageUuid() +
+                     " for directory " + sd.getRoot());
+      }
{code}

It would be good to add some logging for the various cases here.  If we are 
generating a new storage ID because the previous one was invalid, we should log 
that the previous one was invalid somewhere.

{code}
     if (this.layoutVersion > HdfsConstants.DATANODE_LAYOUT_VERSION) {
{code}

Is this "if" statement really valid?  It seems like right now, there are 
clusters out there that are on the latest layout version, but which don't have 
"valid" storage IDs.  We should either bump the NN layout version, or 
unconditionally check that the storage ID is valid, right?

> NameNode not handling heartbeats properly after HDFS-2832
> ---------------------------------------------------------
>
>                 Key: HDFS-7575
>                 URL: https://issues.apache.org/jira/browse/HDFS-7575
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0
>            Reporter: Lars Francke
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch
>
>
> Before HDFS-2832 each DataNode would have a unique storageId which included 
> its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
> storage directory which is just a random UUID.
> They send reports per storage directory in their heartbeats. This heartbeat 
> is processed on the NameNode in the 
> {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
> just store the information per Datanode. After the patch though each DataNode 
> can have multiple different storages so it's stored in a map keyed by the 
> storage Id.
> This works fine for all clusters that have been installed post HDFS-2832 as 
> they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
> different keys. On each Heartbeat the Map is searched and updated 
> ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
> {code:title=DatanodeStorageInfo}
>   void updateState(StorageReport r) {
>     capacity = r.getCapacity();
>     dfsUsed = r.getDfsUsed();
>     remaining = r.getRemaining();
>     blockPoolUsed = r.getBlockPoolUsed();
>   }
> {code}
> On clusters that were upgraded from a pre HDFS-2832 version though the 
> storage Id has not been rewritten (at least not on the four clusters I 
> checked) so each directory will have the exact same storageId. That means 
> there'll be only a single entry in the {{storageMap}} and it'll be 
> overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
> in the {{updateState}} method above. This just assigns the capacity from the 
> received report, instead it should probably sum it up per received heartbeat.
> The Balancer seems to be one of the only things that actually uses this 
> information so it now considers the utilization of a random drive per 
> DataNode for balancing purposes.
> Things get even worse when a drive has been added or replaced as this will 
> now get a new storage Id so there'll be two entries in the storageMap. As new 
> drives are usually empty it skewes the balancers decision in a way that this 
> node will never be considered over-utilized.
> Another problem is that old StorageReports are never removed from the 
> storageMap. So if I replace a drive and it gets a new storage Id the old one 
> will still be in place and used for all calculations by the Balancer until a 
> restart of the NameNode.
> I can try providing a patch that does the following:
> * Instead of using a Map I could just store the array we receive or instead 
> of storing an array sum up the values for reports with the same Id
> * On each heartbeat clear the map (so we know we have up to date information)
> Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to