[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

Colin Patrick McCabe (JIRA) Wed, 14 Jan 2015 13:47:09 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277749#comment-14277749
 ]

Colin Patrick McCabe commented on HDFS-7575:
--------------------------------------------

bq. Suresh wrote: I agree with Daryn Sharp that there isno need to change the 
layout here. Layout change is only necessary if the two layouts are not 
compatible and the downgrade does not work from newer release to older. Is that 
the case here?

The new layout used in HDFS-6482 is backwards compatible, in the sense that 
older versions of hadoop can run with it.  HDFS-6482 just added the invariant 
that block ID uniquely determines which subdir a block is in, but subdirs 
already existed.  Does that mean we shouldn't have changed the layout version 
for HDFS-6482?  I think the answer is clear.

bq. Daryn wrote: Since we know duplicate storage ids are bad, I think the 
correct logic is to always sanity check the storage ids at startup. If there 
are collisions, then the storage should be updated. Rollback should not restore 
a bug by reverting the storage id to a dup.

I'm surprised to hear you say that rollback should not be an option.  It seems 
like the conservative thing to do here is to allow the user to restore to the 
VERSION file.  Obviously we believe there will be no problems.  But we always 
believe that, or else we wouldn't have made the change.  Sometimes there are 
problems.

bq. BTW, UUID.randomUUID isn't guaranteed to return a unique id. It's highly 
improbable, but possible, although more likely due to older storages, user 
copying a storage, etc.

This is really not a good argument.  Collisions in 128-bit space are extremely 
unlikely.  You will never see one in your lifetime.  Up until HDFS-4645, HDFS 
used randomly generated block IDs drawn from a far smaller space-- 2^64 -- and 
we never had a problem.  Phrases like "billions and billions" and "total number 
of grains of sand in the world" don't begin to approach the size of 2^128.

I think it's frustrating for storage IDs to change without warning just because 
HDFS was restarted.  It will make diagnosing problems by reading log files 
harder because storageIDs might morph at any time.  It also sets a bad 
precedent of not allowing downgrade and modifying VERSION files "on the fly" 
during startup.

> NameNode not handling heartbeats properly after HDFS-2832
> ---------------------------------------------------------
>
>                 Key: HDFS-7575
>                 URL: https://issues.apache.org/jira/browse/HDFS-7575
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0
>            Reporter: Lars Francke
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
> HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
> HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
> testUpgrade22via24GeneratesStorageIDs.tgz, 
> testUpgradeFrom22GeneratesStorageIDs.tgz, 
> testUpgradeFrom24PreservesStorageId.tgz
>
>
> Before HDFS-2832 each DataNode would have a unique storageId which included 
> its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
> storage directory which is just a random UUID.
> They send reports per storage directory in their heartbeats. This heartbeat 
> is processed on the NameNode in the 
> {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
> just store the information per Datanode. After the patch though each DataNode 
> can have multiple different storages so it's stored in a map keyed by the 
> storage Id.
> This works fine for all clusters that have been installed post HDFS-2832 as 
> they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
> different keys. On each Heartbeat the Map is searched and updated 
> ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
> {code:title=DatanodeStorageInfo}
>   void updateState(StorageReport r) {
>     capacity = r.getCapacity();
>     dfsUsed = r.getDfsUsed();
>     remaining = r.getRemaining();
>     blockPoolUsed = r.getBlockPoolUsed();
>   }
> {code}
> On clusters that were upgraded from a pre HDFS-2832 version though the 
> storage Id has not been rewritten (at least not on the four clusters I 
> checked) so each directory will have the exact same storageId. That means 
> there'll be only a single entry in the {{storageMap}} and it'll be 
> overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
> in the {{updateState}} method above. This just assigns the capacity from the 
> received report, instead it should probably sum it up per received heartbeat.
> The Balancer seems to be one of the only things that actually uses this 
> information so it now considers the utilization of a random drive per 
> DataNode for balancing purposes.
> Things get even worse when a drive has been added or replaced as this will 
> now get a new storage Id so there'll be two entries in the storageMap. As new 
> drives are usually empty it skewes the balancers decision in a way that this 
> node will never be considered over-utilized.
> Another problem is that old StorageReports are never removed from the 
> storageMap. So if I replace a drive and it gets a new storage Id the old one 
> will still be in place and used for all calculations by the Balancer until a 
> restart of the NameNode.
> I can try providing a patch that does the following:
> * Instead of using a Map I could just store the array we receive or instead 
> of storing an array sum up the values for reports with the same Id
> * On each heartbeat clear the map (so we know we have up to date information)
> Does that sound sensible?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832

Reply via email to