[jira] [Commented] (HDFS-14531) Datanode's ScanInfo requires excessive memory
[ https://issues.apache.org/jira/browse/HDFS-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858709#comment-16858709 ] Nathan Roberts commented on HDFS-14531: --- Actually, maybe disabling the DirectoryScanner is more than a workaround. Maybe that should be the default. What is this really protecting against these days? For large disks it's super expensive memory-wise and if there are enough blocks or enough system memory pressure it can cause tons of I/O as well. > Datanode's ScanInfo requires excessive memory > - > > Key: HDFS-14531 > URL: https://issues.apache.org/jira/browse/HDFS-14531 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.0.0-alpha >Reporter: Daryn Sharp >Priority: Major > Attachments: Screen Shot 2019-05-31 at 12.25.54 PM.png > > > The DirectoryScanner's ScanInfo map consumes ~4.5X memory as replicas as the > replica map. For 1.1M replicas: the replica map is ~91M while the scan info > is ~405M. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12441: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0 > > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168433#comment-16168433 ] Nathan Roberts commented on HDFS-12441: --- Cherry picked to branch-3.0, branch-2, and branch-2.8. > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0 > > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12441: -- Fix Version/s: 2.8.3 2.9.0 > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0 > > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12441: -- Fix Version/s: 3.0.0-beta1 > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Fix For: 3.0.0-beta1, 3.1.0 > > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12441: -- Fix Version/s: 3.1.0 > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Fix For: 3.1.0 > > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12441) Suppress UnresolvedPathException in namenode log
[ https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168277#comment-16168277 ] Nathan Roberts commented on HDFS-12441: --- Thanks [~kihwal] for the patch and [~shahrs87] for the review. +1. I will commit this shortly. > Suppress UnresolvedPathException in namenode log > > > Key: HDFS-12441 > URL: https://issues.apache.org/jira/browse/HDFS-12441 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Minor > Attachments: HDFS-12441.patch > > > {{UnresolvedPathException}} as a normal process of resolving symlinks. This > doesn't need to be logged at all. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block
[ https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111696#comment-16111696 ] Nathan Roberts commented on HDFS-12102: --- [~arpitagarwal] Hi Arpit. To provide a bit more background on this feature - we've seen multiple cases where there are many bad blocks stored on a disk. Just because of the way drives tend to fail, one bad block indicates there are probably many others. The volumeScanner will eventually find them over a multi-week period, but this leaves the cluster susceptible to data-loss due to lots of replicas being corrupt on a single misbehaving disk. The idea with this jira is to use a found corrupt block as a hint that there are likely more and we should do a scan over the drive at a faster rate to more quickly find other corrupt blocks on the drive. Thoughts? > VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt > block > > > Key: HDFS-12102 > URL: https://issues.apache.org/jira/browse/HDFS-12102 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Affects Versions: 2.8.2 >Reporter: Ashwin Ramesh >Priority: Minor > Fix For: 2.8.2 > > Attachments: HDFS-12102-001.patch, HDFS-12102-002.patch, > HDFS-12102-003.patch > > > When the Volume scanner sees a corrupt block, it restarts the scan and scans > the blocks at much faster rate with a negligible scan period. This is so that > it doesn't take 3 weeks to report blocks since a corrupt block means > increased likelihood that there are more corrupt blocks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-5042) Completed files lost after power failure
[ https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083970#comment-16083970 ] Nathan Roberts commented on HDFS-5042: -- Wondering if we should make this feature configurable. There are some filesystems (like ext4), where these fsync's are affecting much more than the datanode process. If YARN is using the same disks and is writing significant amounts of intermediate data or performing other disk-heavy operations, the entire system will see significantly degraded performance (like disks at 100% for 10s of minutes). > Completed files lost after power failure > > > Key: HDFS-5042 > URL: https://issues.apache.org/jira/browse/HDFS-5042 > Project: Hadoop HDFS > Issue Type: Bug > Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5) >Reporter: Dave Latham >Assignee: Vinayakumar B >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, > HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, > HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, > HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.7-06.patch, > HDFS-5042-branch-2.8-05.patch, HDFS-5042-branch-2.8-06.patch, > HDFS-5042-branch-2.8-addendum.patch > > > We suffered a cluster wide power failure after which HDFS lost data that it > had acknowledged as closed and complete. > The client was HBase which compacted a set of HFiles into a new HFile, then > after closing the file successfully, deleted the previous versions of the > file. The cluster then lost power, and when brought back up the newly > created file was marked CORRUPT. > Based on reading the logs it looks like the replicas were created by the > DataNodes in the 'blocksBeingWritten' directory. Then when the file was > closed they were moved to the 'current' directory. After the power cycle > those replicas were again in the blocksBeingWritten directory of the > underlying file system (ext3). When those DataNodes reported in to the > NameNode it deleted those replicas and lost the file. > Some possible fixes could be having the DataNode fsync the directory(s) after > moving the block from blocksBeingWritten to current to ensure the rename is > durable or having the NameNode accept replicas from blocksBeingWritten under > certain circumstances. > Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode): > {noformat} > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: > Creating > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > with permission=rwxrwxrwx > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c. > blk_1395839728632046111_357084589 > DN 2013-06-29 11:16:06,832 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: > /10.0.5.237:50010 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to > blk_1395839728632046111_357084589 size 25418340 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327 > DN 2013-06-29 11:16:11,385 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block > blk_1395839728632046111_357084589 terminating > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing > lease on file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > from client DFSClient_hb_rs_hs745,60020,1372470111932 > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > is closed by DFSClient_hb_rs_hs745,60020,1372470111932 > RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming compacted file at > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c > to >
[jira] [Commented] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block
[ https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078714#comment-16078714 ] Nathan Roberts commented on HDFS-12102: --- Hi [~aramesh2]. Thanks for the patch. Couple of quick comments (I'll look more closely early next week) - Need to change hdfs-default.xml to include new config options - LOG.debug statements should be surrounded by if (LOG.isDebugEnabled()) checks (reduces overhead) - Is it possible to disable the fast-scan behavior all together (i.e. might be good for default behavior to remain the same. If someone want's fast-scan, they have to enable it). Maybe setting the period to -1 could be a way? - Description of corruptBlockThreshold doesn't really match its use. I think it's just a straight count. Maybe we don't even need it at all since it's not configurable and only ever set to 1. > VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt > block > > > Key: HDFS-12102 > URL: https://issues.apache.org/jira/browse/HDFS-12102 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Affects Versions: 2.8.2 >Reporter: Ashwin Ramesh >Priority: Minor > Fix For: 2.8.2 > > Attachments: HDFS-12102-001.patch > > > When the Volume scanner sees a corrupt block, it restarts the scan and scans > the blocks at much faster rate with a negligible scan period. This is so that > it doesn't take 3 weeks to report blocks since a corrupt block means > increased likelihood that there are more corrupt blocks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block
[ https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12102: -- Issue Type: New Feature (was: Improvement) > VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt > block > > > Key: HDFS-12102 > URL: https://issues.apache.org/jira/browse/HDFS-12102 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Affects Versions: 2.8.2 >Reporter: Ashwin Ramesh >Priority: Minor > Fix For: 2.8.2 > > Attachments: HDFS-12102-001.patch > > > When the Volume scanner sees a corrupt block, it restarts the scan and scans > the blocks at much faster rate with a negligible scan period. This is so that > it doesn't take 3 weeks to report blocks since a corrupt block means > increased likelihood that there are more corrupt blocks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block
[ https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-12102: -- Issue Type: Improvement (was: New Feature) > VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt > block > > > Key: HDFS-12102 > URL: https://issues.apache.org/jira/browse/HDFS-12102 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs >Affects Versions: 2.8.2 >Reporter: Ashwin Ramesh >Priority: Minor > Fix For: 2.8.2 > > Attachments: HDFS-12102-001.patch > > > When the Volume scanner sees a corrupt block, it restarts the scan and scans > the blocks at much faster rate with a negligible scan period. This is so that > it doesn't take 3 weeks to report blocks since a corrupt block means > increased likelihood that there are more corrupt blocks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11818: -- Affects Version/s: 3.0.0-alpha2 Status: Patch Available (was: Open) > TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently > --- > > Key: HDFS-11818 > URL: https://issues.apache.org/jira/browse/HDFS-11818 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.2 >Reporter: Eric Badger >Assignee: Nathan Roberts > Attachments: HDFS-11818-branch-2.patch, HDFS-11818.patch > > > Saw a weird Mockito failure in last night's build with the following stack > trace: > {noformat} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > INodeFile cannot be returned by isRunning() > isRunning() should return boolean > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397) > {noformat} > This is pretty confusing since we explicitly set isRunning() to return true > in TestBlockManager's \@Before method > {noformat} > 154Mockito.doReturn(true).when(fsn).isRunning(); > {noformat} > Also saw the following exception in the logs: > {noformat} > 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager > (BlockManager.java:run(2796)) - Error while processing replication queues > async > org.mockito.exceptions.base.MockitoException: > 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a > *return value*! > Voids are usually stubbed with Throwables: > doThrow(exception).when(mock).someVoidMethod(); > If the method you are trying to stub is *overloaded* then make sure you are > calling the right overloaded version. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792) > {noformat} > This is also weird since we don't do any explicit mocking with > {{writeLockInterruptibly}} via fsn in the test. It has to be something > changing the mocks or non-thread safe access or something like that. I can't > explain the failures otherwise. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11818: -- Attachment: HDFS-11818.patch HDFS-11818-branch-2.patch Patches for trunk and branch-2. branch-2 patch picks cleanly to 2.8 > TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently > --- > > Key: HDFS-11818 > URL: https://issues.apache.org/jira/browse/HDFS-11818 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.2 >Reporter: Eric Badger >Assignee: Nathan Roberts > Attachments: HDFS-11818-branch-2.patch, HDFS-11818.patch > > > Saw a weird Mockito failure in last night's build with the following stack > trace: > {noformat} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > INodeFile cannot be returned by isRunning() > isRunning() should return boolean > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397) > {noformat} > This is pretty confusing since we explicitly set isRunning() to return true > in TestBlockManager's \@Before method > {noformat} > 154Mockito.doReturn(true).when(fsn).isRunning(); > {noformat} > Also saw the following exception in the logs: > {noformat} > 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager > (BlockManager.java:run(2796)) - Error while processing replication queues > async > org.mockito.exceptions.base.MockitoException: > 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a > *return value*! > Voids are usually stubbed with Throwables: > doThrow(exception).when(mock).someVoidMethod(); > If the method you are trying to stub is *overloaded* then make sure you are > calling the right overloaded version. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792) > {noformat} > This is also weird since we don't do any explicit mocking with > {{writeLockInterruptibly}} via fsn in the test. It has to be something > changing the mocks or non-thread safe access or something like that. I can't > explain the failures otherwise. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008325#comment-16008325 ] Nathan Roberts commented on HDFS-11818: --- Know what the issue is. Will post a patch shortly. > TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently > --- > > Key: HDFS-11818 > URL: https://issues.apache.org/jira/browse/HDFS-11818 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Eric Badger >Assignee: Nathan Roberts > > Saw a weird Mockito failure in last night's build with the following stack > trace: > {noformat} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > INodeFile cannot be returned by isRunning() > isRunning() should return boolean > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397) > {noformat} > This is pretty confusing since we explicitly set isRunning() to return true > in TestBlockManager's \@Before method > {noformat} > 154Mockito.doReturn(true).when(fsn).isRunning(); > {noformat} > Also saw the following exception in the logs: > {noformat} > 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager > (BlockManager.java:run(2796)) - Error while processing replication queues > async > org.mockito.exceptions.base.MockitoException: > 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a > *return value*! > Voids are usually stubbed with Throwables: > doThrow(exception).when(mock).someVoidMethod(); > If the method you are trying to stub is *overloaded* then make sure you are > calling the right overloaded version. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792) > {noformat} > This is also weird since we don't do any explicit mocking with > {{writeLockInterruptibly}} via fsn in the test. It has to be something > changing the mocks or non-thread safe access or something like that. I can't > explain the failures otherwise. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts reassigned HDFS-11818: - Assignee: Nathan Roberts > TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently > --- > > Key: HDFS-11818 > URL: https://issues.apache.org/jira/browse/HDFS-11818 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Eric Badger >Assignee: Nathan Roberts > > Saw a weird Mockito failure in last night's build with the following stack > trace: > {noformat} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > INodeFile cannot be returned by isRunning() > isRunning() should return boolean > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397) > {noformat} > This is pretty confusing since we explicitly set isRunning() to return true > in TestBlockManager's \@Before method > {noformat} > 154Mockito.doReturn(true).when(fsn).isRunning(); > {noformat} > Also saw the following exception in the logs: > {noformat} > 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager > (BlockManager.java:run(2796)) - Error while processing replication queues > async > org.mockito.exceptions.base.MockitoException: > 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a > *return value*! > Voids are usually stubbed with Throwables: > doThrow(exception).when(mock).someVoidMethod(); > If the method you are trying to stub is *overloaded* then make sure you are > calling the right overloaded version. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792) > {noformat} > This is also weird since we don't do any explicit mocking with > {{writeLockInterruptibly}} via fsn in the test. It has to be something > changing the mocks or non-thread safe access or something like that. I can't > explain the failures otherwise. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005127#comment-16005127 ] Nathan Roberts commented on HDFS-11755: --- The failing unit tests in trunk have been unstable in precommit: org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testMultipleVolFailuresOnNode org.apache.hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure.testMultipleDatanodeFailure56 The timed out test TestLeaseRecovery2 does not fail locally and has also been unstable across multiple precommit runs on this jira. > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch, > HDFS-11755-branch-2.002.patch, HDFS-11755-branch-2.8.002.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Attachment: HDFS-11755-branch-2.002.patch HDFS-11755-branch-2.8.002.patch branch-2 and branch-2.8 patches. > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch, > HDFS-11755-branch-2.002.patch, HDFS-11755-branch-2.8.002.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Attachment: HDFS-11755.002.patch Fixed Checkstyle Fixed testSetReplicationWhenBatchIBR because it was expecting a setReplication() on a file with only under construction blocks to cause underReplicated counts to increase. > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Status: Patch Available (was: Open) > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Status: Open (was: Patch Available) > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003459#comment-16003459 ] Nathan Roberts commented on HDFS-11755: --- bq. Do you know which one makes more sense? Not an expert in this area but here's my understanding. When a block is completed and the client has received the necessary acks, the client either adds another block, or completes the file. Both cause the namenode to consider the block complete, and at that point the namenode will properly maintain replication of the completed block. If the pipeline fails while writing, the client may (depends on policy configured) rebuild the pipeline to maintain the desired level of replication in the pipeline. So, while a block is mutating, it is the client that is ultimately responsible for making sure enough datanodes remain in the pipeline and in-sync with the data. Once a block is complete, it becomes the namenode's responsibility to maintain replication. If a client dies and fails to complete the last block, after a timeout, lease recovery will cause the file to be closed and the blocks to be properly synchronized and committed if possible. There is also hsync(), which applications can use to enhance the durability guarantees at the datanode (via fsync). Hope that helps a little. > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Status: Patch Available (was: Open) v1 of trunk patch. branch 2 will require a separate patch. > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing
[ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-11755: -- Attachment: HDFS-11755.001.patch > Underconstruction blocks can be considered missing > -- > > Key: HDFS-11755 > URL: https://issues.apache.org/jira/browse/HDFS-11755 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.0.0-alpha2, 2.8.1 >Reporter: Nathan Roberts >Assignee: Nathan Roberts > Attachments: HDFS-11755.001.patch > > > Following sequence of events can lead to a block underconstruction being > considered missing. > - pipeline of 3 DNs, DN1->DN2->DN3 > - DN3 has a failing disk so some updates take a long time > - Client writes entire block and is waiting for final ack > - DN1, DN2 and DN3 have all received the block > - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 > - DN3 is having trouble finalizing the block due to the failing drive. It > does eventually succeed but it is VERY slow at doing so. > - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so > DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. > - DN3 finally sends an IBR to the NN indicating the block has been received. > - Drive containing the block on DN3 fails enough that the DN takes it offline > and notifies NN of failed volume > - NN removes DN3's replica from the triplets and then declares the block > missing because there are no other replicas > Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11755) Underconstruction blocks can be considered missing
Nathan Roberts created HDFS-11755: - Summary: Underconstruction blocks can be considered missing Key: HDFS-11755 URL: https://issues.apache.org/jira/browse/HDFS-11755 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.0.0-alpha2, 2.8.1 Reporter: Nathan Roberts Assignee: Nathan Roberts Following sequence of events can lead to a block underconstruction being considered missing. - pipeline of 3 DNs, DN1->DN2->DN3 - DN3 has a failing disk so some updates take a long time - Client writes entire block and is waiting for final ack - DN1, DN2 and DN3 have all received the block - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3 - DN3 is having trouble finalizing the block due to the failing drive. It does eventually succeed but it is VERY slow at doing so. - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so DN1 notices and does the same. Neither DN1 nor DN2 finalized the block. - DN3 finally sends an IBR to the NN indicating the block has been received. - Drive containing the block on DN3 fails enough that the DN takes it offline and notifies NN of failed volume - NN removes DN3's replica from the triplets and then declares the block missing because there are no other replicas Seems like we shouldn't consider uncompleted blocks for replication. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11752) getNonDfsUsed return 0 if reserved bigger than actualNonDfsUsed
[ https://issues.apache.org/jira/browse/HDFS-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996815#comment-15996815 ] Nathan Roberts commented on HDFS-11752: --- I think this is intentional. My understanding is that nonDFSUsed is supposed to be the amount of HDFS space that is consumed by non HDFS entities. Since the Reserved space is not available to HDFS, nonDFSUsed shouldn't include any usage covered by the reservation. Are you seeing issues in 2.7 or 3.0 because there have been fixes in 2.8 and 3.0 which change some of the calculations in this area? 2.7 definitely had problems where it was not correctly calculating the amount of remaining space. The 2.8 calculations seem correct (I didn't try 3.0 but as long as nothing regressed it should be ok as well). > getNonDfsUsed return 0 if reserved bigger than actualNonDfsUsed > --- > > Key: HDFS-11752 > URL: https://issues.apache.org/jira/browse/HDFS-11752 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Affects Versions: 2.7.1 >Reporter: maobaolong > Labels: datanode, hdfs > Fix For: 2.7.1 > > > {code} > public long getNonDfsUsed() throws IOException { > long actualNonDfsUsed = getActualNonDfsUsed(); > if (actualNonDfsUsed < reserved) { > return 0L; > } > return actualNonDfsUsed - reserved; > } > {code} > The code block above is the function to caculate nonDfsUsed, but in fact it > will let the result to be 0L out of expect. Such as this following situation: > du.reserved = 50G > Disk Capacity = 2048G > Disk Available = 2000G > Dfs used = 30G > usage.getUsed() = dirFile.getTotalSpace() - dirFile.getFreeSpace() > = 2048G - 2000G > = 48G > getActualNonDfsUsed = usage.getUsed() - getDfsUsed() > = 48G - 30G > = 18G > 18G < 50G, so the function `getNonDfsUsed` actualNonDfsUsed < reserved, and > the NonDfsUsed will return 0, is that logic make sense? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11661) GetContentSummary uses excessive amounts of memory
[ https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977163#comment-15977163 ] Nathan Roberts edited comment on HDFS-11661 at 4/20/17 6:05 PM: [~jojochuang], sure, occasional `du -s` of very large directory trees, think 100s of millions of files/directories. was (Author: nroberts): [~jojochuang], sure, occasional `du -s` of very large directory trees, think many 100s of millions of files/directories. > GetContentSummary uses excessive amounts of memory > -- > > Key: HDFS-11661 > URL: https://issues.apache.org/jira/browse/HDFS-11661 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Nathan Roberts >Priority: Blocker > Attachments: Heap growth.png > > > ContentSummaryComputationContext::nodeIncluded() is being used to keep track > of all INodes visited during the current content summary calculation. This > can be all of the INodes in the filesystem, making for a VERY large hash > table. This simply won't work on large filesystems. > We noticed this after upgrading a namenode with ~100Million filesystem > objects was spending significantly more time in GC. Fortunately this system > had some memory breathing room, other clusters we have will not run with this > additional demand on memory. > This was added as part of HDFS-10797 as a way of keeping track of INodes that > have already been accounted for - to avoid double counting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11661) GetContentSummary uses excessive amounts of memory
[ https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977163#comment-15977163 ] Nathan Roberts commented on HDFS-11661: --- [~jojochuang], sure, occasional `du -s` of very large directory trees, think many 100s of millions of files/directories. > GetContentSummary uses excessive amounts of memory > -- > > Key: HDFS-11661 > URL: https://issues.apache.org/jira/browse/HDFS-11661 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.8.0, 3.0.0-alpha2 >Reporter: Nathan Roberts >Priority: Blocker > Attachments: Heap growth.png > > > ContentSummaryComputationContext::nodeIncluded() is being used to keep track > of all INodes visited during the current content summary calculation. This > can be all of the INodes in the filesystem, making for a VERY large hash > table. This simply won't work on large filesystems. > We noticed this after upgrading a namenode with ~100Million filesystem > objects was spending significantly more time in GC. Fortunately this system > had some memory breathing room, other clusters we have will not run with this > additional demand on memory. > This was added as part of HDFS-10797 as a way of keeping track of INodes that > have already been accounted for - to avoid double counting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice
[ https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973105#comment-15973105 ] Nathan Roberts commented on HDFS-10797: --- After deploying this to a cluster with a few hundred nodes, we have discovered that this jira has caused significant memory bloat in the namenode. Filed 2.8.1 blocker for this issue - HDFS-11661. > Disk usage summary of snapshots causes renamed blocks to get counted twice > -- > > Key: HDFS-10797 > URL: https://issues.apache.org/jira/browse/HDFS-10797 > Project: Hadoop HDFS > Issue Type: Bug > Components: snapshots >Affects Versions: 2.8.0 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Fix For: 2.8.0, 3.0.0-alpha2 > > Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, > HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch, > HDFS-10797.006.patch, HDFS-10797.007.patch, HDFS-10797.008.patch, > HDFS-10797.009.patch, HDFS-10797.010.patch, HDFS-10797.010.patch > > > DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how > much disk usage is used by a snapshot by tallying up the files in the > snapshot that have since been deleted (that way it won't overlap with regular > files whose disk usage is computed separately). However that is determined > from a diff that shows moved (to Trash or otherwise) or renamed files as a > deletion and a creation operation that may overlap with the list of blocks. > Only the deletion operation is taken into consideration, and this causes > those blocks to get represented twice in the disk usage tallying. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11661) GetContentSummary uses excessive amounts of memory
Nathan Roberts created HDFS-11661: - Summary: GetContentSummary uses excessive amounts of memory Key: HDFS-11661 URL: https://issues.apache.org/jira/browse/HDFS-11661 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.8.0 Reporter: Nathan Roberts Priority: Blocker ContentSummaryComputationContext::nodeIncluded() is being used to keep track of all INodes visited during the current content summary calculation. This can be all of the INodes in the filesystem, making for a VERY large hash table. This simply won't work on large filesystems. We noticed this after upgrading a namenode with ~100Million filesystem objects was spending significantly more time in GC. Fortunately this system had some memory breathing room, other clusters we have will not run with this additional demand on memory. This was added as part of HDFS-10797 as a way of keeping track of INodes that have already been accounted for - to avoid double counting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-4660) Block corruption can happen during pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-4660: - Attachment: periodic_hflush.patch > Block corruption can happen during pipeline recovery > > > Key: HDFS-4660 > URL: https://issues.apache.org/jira/browse/HDFS-4660 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 2.0.3-alpha, 3.0.0-alpha1 >Reporter: Peng Zhang >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 2.7.1, 2.6.4 > > Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, > HDFS-4660.v2.patch, periodic_hflush.patch > > > pipeline DN1 DN2 DN3 > stop DN2 > pipeline added node DN4 located at 2nd position > DN1 DN4 DN3 > recover RBW > DN4 after recover rbw > 2013-04-01 21:02:31,570 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover > RBW replica > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004 > 2013-04-01 21:02:31,570 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: > Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW > getNumBytes() = 134144 > getBytesOnDisk() = 134144 > getVisibleLength()= 134144 > end at chunk (134144/512=262) > DN3 after recover rbw > 2013-04-01 21:02:31,575 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover > RBW replica > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 > 21:02:31,575 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: > Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW > getNumBytes() = 134028 > getBytesOnDisk() = 134028 > getVisibleLength()= 134028 > client send packet after recover pipeline > offset=133632 len=1008 > DN4 after flush > 2013-04-01 21:02:31,779 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file > offset:134640; meta offset:1063 > // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is > 1063. > DN3 after flush > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, > type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, > lastPacketInBlock=false, offsetInBlock=134640, > ackEnqueueNanoTime=8817026136871545) > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing > meta file offset of block > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from > 1055 to 1051 > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file > offset:134640; meta offset:1059 > After checking meta on DN4, I found checksum of chunk 262 is duplicated, but > data not. > Later after block was finalized, DN4's scanner detected bad block, and then > reported it to NM. NM send a command to delete this block, and replicate this > block from other DN in pipeline to satisfy duplication num. > I think this is because in BlockReceiver it skips data bytes already written, > but not skips checksum bytes already written. And function > adjustCrcFilePosition is only used for last non-completed chunk, but > not for this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437036#comment-15437036 ] Nathan Roberts commented on HDFS-4660: -- Hi [~yzhangal]. Had to go back to an old git stash, but I'll attach a sample patch to TeraOutputFormat. > Block corruption can happen during pipeline recovery > > > Key: HDFS-4660 > URL: https://issues.apache.org/jira/browse/HDFS-4660 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 2.0.3-alpha, 3.0.0-alpha1 >Reporter: Peng Zhang >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 2.7.1, 2.6.4 > > Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, > HDFS-4660.v2.patch > > > pipeline DN1 DN2 DN3 > stop DN2 > pipeline added node DN4 located at 2nd position > DN1 DN4 DN3 > recover RBW > DN4 after recover rbw > 2013-04-01 21:02:31,570 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover > RBW replica > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004 > 2013-04-01 21:02:31,570 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: > Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW > getNumBytes() = 134144 > getBytesOnDisk() = 134144 > getVisibleLength()= 134144 > end at chunk (134144/512=262) > DN3 after recover rbw > 2013-04-01 21:02:31,575 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover > RBW replica > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 > 21:02:31,575 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: > Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW > getNumBytes() = 134028 > getBytesOnDisk() = 134028 > getVisibleLength()= 134028 > client send packet after recover pipeline > offset=133632 len=1008 > DN4 after flush > 2013-04-01 21:02:31,779 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file > offset:134640; meta offset:1063 > // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is > 1063. > DN3 after flush > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, > type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, > lastPacketInBlock=false, offsetInBlock=134640, > ackEnqueueNanoTime=8817026136871545) > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing > meta file offset of block > BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from > 1055 to 1051 > 2013-04-01 21:02:31,782 DEBUG > org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file > offset:134640; meta offset:1059 > After checking meta on DN4, I found checksum of chunk 262 is duplicated, but > data not. > Later after block was finalized, DN4's scanner detected bad block, and then > reported it to NM. NM send a command to delete this block, and replicate this > block from other DN in pipeline to satisfy duplication num. > I think this is because in BlockReceiver it skips data bytes already written, > but not skips checksum bytes already written. And function > adjustCrcFilePosition is only used for last non-completed chunk, but > not for this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218470#comment-15218470 ] Nathan Roberts commented on HDFS-9239: -- bq. Just to make sure I'm clear, are you talking about configuring the deadline scheduler as described here? Yes, those links are talking about the right parameters. We currently run with read_expire=1000, write_expire=1000, and writes_starved=1. Since our I/O workloads change dramatically over time, we didn't spend a lot of time looking for optimal values here. These have been working well for the last several months across multiple clusters. As an aside, a relatively easy way to reproduce this problem, is to put a heavy seek load on all the disks of a datanode (e.g. http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies of seeker were sufficient.) After a minute or so, system becomes almost unusable and datanode will be declared lost. This might be a good test to run against the lifeline protocol. My hunch is, with CFQ, the datanode will still be lost. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness
[ https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218302#comment-15218302 ] Nathan Roberts commented on HDFS-9239: -- bq. However,making it lighter on the datanode side is a good idea. We have seen many cases where nodes are declared dead because the service actor thread is delayed/blocked. Just a quick update on this comment. Even after HDFS-7060 we still had cases where Datanodes would fail to heartbeat in. We eventually tracked this down to the RHEL CFQ I/O scheduler. There are situations where significant seek activity (like a massive shuffle) can cause this I/O scheduler to indefinitely starve writers. This eventually causes the datanode and/or nodemanager processes to completely stop (probably due to logging I/O backing up). So, no matter how smart we make these daemons, they are going to be lost from the NN/RM point of view in these situations. But, this is actually probably the right thing to do in these cases, these daemons are clearly not able to do their job so SHOULD be declared lost. In any event, the change which we found most valuable for this situation was to use the deadline I/O scheduler. This dramatically improved the number of lost datanodes and nodemanagers we were seeing. > DataNode Lifeline Protocol: an alternative protocol for reporting DataNode > liveness > --- > > Key: HDFS-9239 > URL: https://issues.apache.org/jira/browse/HDFS-9239 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, namenode >Reporter: Chris Nauroth >Assignee: Chris Nauroth > Fix For: 2.8.0 > > Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, > HDFS-9239.002.patch, HDFS-9239.003.patch > > > This issue proposes introduction of a new feature: the DataNode Lifeline > Protocol. This is an RPC protocol that is responsible for reporting liveness > and basic health information about a DataNode to a NameNode. Compared to the > existing heartbeat messages, it is lightweight and not prone to resource > contention problems that can harm accurate tracking of DataNode liveness > currently. The attached design document contains more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
[ https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-4946: - Affects Version/s: 3.0.0 2.7.2 Target Version/s: 3.0.0, 2.8.0 Status: Patch Available (was: Reopened) > Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable > --- > > Key: HDFS-4946 > URL: https://issues.apache.org/jira/browse/HDFS-4946 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.7.2, 2.0.0-alpha, 3.0.0 >Reporter: James Kinley >Assignee: James Kinley > Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch > > > Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in > configuration to prevent a client from writing the first replica of every > block (i.e. the entire file) to the local DataNode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
[ https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts reopened HDFS-4946: -- [~jrkinley], re-opening because this is a very useful patch. Let me know if you disagree or would like me to assign it to myself to close out any remaining issues. > Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable > --- > > Key: HDFS-4946 > URL: https://issues.apache.org/jira/browse/HDFS-4946 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: James Kinley >Assignee: James Kinley > Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch > > > Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in > configuration to prevent a client from writing the first replica of every > block (i.e. the entire file) to the local DataNode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
[ https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-4946: - Attachment: HDFS-4946-2.patch Uploaded a new version of this patch for trunk. We have found this config to be extremely useful across many large clusters. It avoids hot-spots for large files that can be quite problematic during localization and/or task scheduling. Hopefully folks will be agreeable to this simple config option. > Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable > --- > > Key: HDFS-4946 > URL: https://issues.apache.org/jira/browse/HDFS-4946 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.0.0-alpha >Reporter: James Kinley >Assignee: James Kinley > Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch > > > Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in > configuration to prevent a client from writing the first replica of every > block (i.e. the entire file) to the local DataNode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018828#comment-15018828 ] Nathan Roberts commented on HDFS-7060: -- Has anyone given any further thought on this patch? It seems safe to me and it eliminates serious stability issues when datanodes' disks get very busy. _getUsed()_ returns a value that is calculated by a DU process that ran for a long time anyway, (so it's always somewhat out-of-sync). It's not very difficult to load up a datanode with disk-intensive tasks that prevent the datanode from getting a heartbeat in for several minutes, eventually being declared dead by the NN. We've seen this take out entire clusters with large Map/Reduce merges, as well as very large shuffles. > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Xinwei Qin > Labels: BB2015-05-TBR > Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, > HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014463#comment-15014463 ] Nathan Roberts commented on HDFS-8791: -- Thanks [~ctrezzo] for the patch! Nice writeup on the verification/performance measurements. +1 (non-binding) on the patch. It's nice how concise it was able to be. > block ID-based DN storage layout can be very slow for datanode on ext4 > -- > > Key: HDFS-8791 > URL: https://issues.apache.org/jira/browse/HDFS-8791 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 2.6.0, 2.8.0, 2.7.1 >Reporter: Nathan Roberts >Assignee: Chris Trezzo >Priority: Critical > Attachments: 32x32DatanodeLayoutTesting-v1.pdf, > HDFS-8791-trunk-v1.patch > > > We are seeing cases where the new directory layout causes the datanode to > basically cause the disks to seek for 10s of minutes. This can be when the > datanode is running du, and it can also be when it is performing a > checkDirs(). Both of these operations currently scan all directories in the > block pool and that's very expensive in the new layout. > The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K > leaf directories where block files are placed. > So, what we have on disk is: > - 256 inodes for the first level directories > - 256 directory blocks for the first level directories > - 256*256 inodes for the second level directories > - 256*256 directory blocks for the second level directories > - Then the inodes and blocks to store the the HDFS blocks themselves. > The main problem is the 256*256 directory blocks. > inodes and dentries will be cached by linux and one can configure how likely > the system is to prune those entries (vfs_cache_pressure). However, ext4 > relies on the buffer cache to cache the directory blocks and I'm not aware of > any way to tell linux to favor buffer cache pages (even if it did I'm not > sure I would want it to in general). > Also, ext4 tries hard to spread directories evenly across the entire volume, > this basically means the 64K directory blocks are probably randomly spread > across the entire disk. A du type scan will look at directories one at a > time, so the ioscheduler can't optimize the corresponding seeks, meaning the > seeks will be random and far. > In a system I was using to diagnose this, I had 60K blocks. A DU when things > are hot is less than 1 second. When things are cold, about 20 minutes. > How do things get cold? > - A large set of tasks run on the node. This pushes almost all of the buffer > cache out, causing the next DU to hit this situation. We are seeing cases > where a large job can cause a seek storm across the entire cluster. > Why didn't the previous layout see this? > - It might have but it wasn't nearly as pronounced. The previous layout would > be a few hundred directory blocks. Even when completely cold, these would > only take a few a hundred seeks which would mean single digit seconds. > - With only a few hundred directories, the odds of the directory blocks > getting modified is quite high, this keeps those blocks hot and much less > likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8873) throttle directoryScanner
[ https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906603#comment-14906603 ] Nathan Roberts commented on HDFS-8873: -- Thanks [~templedf]. I like that the stopwatch class makes this much cleaner. Just a couple of comments: - Shouldn't the isInterrupted() check throw an InterruptedException? Otherwise won't we just break out of one level? It would probably be good to test shutdown on an actual cluster if possible because you're exactly right that we could be in here a long time and it would be good to make sure we don't affect shutdown of the datanode. This has been a problem in the past and can have a serious impact on rolling upgrades. - nit but I find markRunning() and markWaiting() confusing (seem backwards to me because we call markRunning() just before going to sleep). - I'm kind of wondering if we should disallow extremely low duty cycles. Seems like it could take close to 24 hours with a minimum setting. A minimum of 20% should keep us within an hour. > throttle directoryScanner > - > > Key: HDFS-8873 > URL: https://issues.apache.org/jira/browse/HDFS-8873 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.7.1 >Reporter: Nathan Roberts >Assignee: Daniel Templeton > Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, > HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch, > HDFS-8873.006.patch, HDFS-8873.007.patch, HDFS-8873.008.patch > > > The new 2-level directory layout can make directory scans expensive in terms > of disk seeks (see HDFS-8791) for details. > It would be good if the directoryScanner() had a configurable duty cycle that > would reduce its impact on disk performance (much like the approach in > HDFS-8617). > Without such a throttle, disks can go 100% busy for many minutes at a time > (assuming the common case of all inodes in cache but no directory blocks > cached, 64K seeks are required for full directory listing which translates to > 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8873) throttle directoryScanner
[ https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906919#comment-14906919 ] Nathan Roberts commented on HDFS-8873: -- Thanks [~templedf] for the update! I'm +1 (non-binding) for v9 of the patch. > throttle directoryScanner > - > > Key: HDFS-8873 > URL: https://issues.apache.org/jira/browse/HDFS-8873 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.7.1 >Reporter: Nathan Roberts >Assignee: Daniel Templeton > Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, > HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch, > HDFS-8873.006.patch, HDFS-8873.007.patch, HDFS-8873.008.patch, > HDFS-8873.009.patch > > > The new 2-level directory layout can make directory scans expensive in terms > of disk seeks (see HDFS-8791) for details. > It would be good if the directoryScanner() had a configurable duty cycle that > would reduce its impact on disk performance (much like the approach in > HDFS-8617). > Without such a throttle, disks can go 100% busy for many minutes at a time > (assuming the common case of all inodes in cache but no directory blocks > cached, 64K seeks are required for full directory listing which translates to > 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8873) throttle directoryScanner
[ https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904605#comment-14904605 ] Nathan Roberts commented on HDFS-8873: -- Thanks [~templedf] for the update. I am sorry I haven't had a chance to review yet. I plan to do this Thursday AM. > throttle directoryScanner > - > > Key: HDFS-8873 > URL: https://issues.apache.org/jira/browse/HDFS-8873 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.7.1 >Reporter: Nathan Roberts >Assignee: Daniel Templeton > Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, > HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch > > > The new 2-level directory layout can make directory scans expensive in terms > of disk seeks (see HDFS-8791) for details. > It would be good if the directoryScanner() had a configurable duty cycle that > would reduce its impact on disk performance (much like the approach in > HDFS-8617). > Without such a throttle, disks can go 100% busy for many minutes at a time > (assuming the common case of all inodes in cache but no directory blocks > cached, 64K seeks are required for full directory listing which translates to > 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8873) throttle directoryScanner
[ https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876736#comment-14876736 ] Nathan Roberts commented on HDFS-8873: -- bq. Sure. It won't matter since stop() is only called by shutdown(), which first sets shouldRunCompile to false. But for correctness, you're right. I was looking at v3 of the patch, I see you already fixed this in v4 of the patch. sorry for the noise bq. The majority of the patch is refactoring the report compilers so that they can be throttled at all. The additional code to do the throttling isn't much. It's more formal than just a sleep, but it's also more testable and extensible. ok. I'll have to look at that a little deeper. I thought we were basically hitting FileUtil.listFiles(dir) really quickly in the original code so it felt like a very simple thing to do is just do a variable sleep right there based on the configured duty cycle. I need to look more into how the scanjob queue is working. It seems like all the worker threads could be working in the same volume which doesn't seem like what we want. (Seems like we want all volumes spending the duty cycle scanning, but I didn't catch how that was the case). > throttle directoryScanner > - > > Key: HDFS-8873 > URL: https://issues.apache.org/jira/browse/HDFS-8873 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.7.1 >Reporter: Nathan Roberts >Assignee: Daniel Templeton > Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, > HDFS-8873.003.patch, HDFS-8873.004.patch > > > The new 2-level directory layout can make directory scans expensive in terms > of disk seeks (see HDFS-8791) for details. > It would be good if the directoryScanner() had a configurable duty cycle that > would reduce its impact on disk performance (much like the approach in > HDFS-8617). > Without such a throttle, disks can go 100% busy for many minutes at a time > (assuming the common case of all inodes in cache but no directory blocks > cached, 64K seeks are required for full directory listing which translates to > 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8873) throttle directoryScanner
[ https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876500#comment-14876500 ] Nathan Roberts commented on HDFS-8873: -- Thanks [~templedf] for the patch. A few comments. hdfs-default.xml - I personally would prefer the default to be 1000. In my mind 0 is a special out-of-range condition that we're allowing to mean "full rate". Just reading the default of 0 and then the first sentence of the description could easily lead one to believe the report threads are effectively off by default. Test - Any way to avoid the sleep(5000) in the test? Our tests already take a really long time, so anytime we can avoid sleeping it's better. Maybe wait at most 5 seconds for timeWaitingMs.get() to become > 0 directoryScannerThrottle - Shouldn't stop call resume() instead of just notifyAll(). Will cycle() get out if we try to shutdown while in that wait? - Did we hit this problem with too big of hammer? Couldn't cycle() be implemented with a simple sleep? For example, with a 75% duty cycle, {noformat} n = Time.monotonicNow() % 1000; if (n > 1000 * 0.75) sleep(1000- n) {noformat} Seems like it could be as simple as a config and a couple of lines of code. Maybe I'm missing something or there are grander plans for the throttle. > throttle directoryScanner > - > > Key: HDFS-8873 > URL: https://issues.apache.org/jira/browse/HDFS-8873 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.7.1 >Reporter: Nathan Roberts >Assignee: Daniel Templeton > Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, > HDFS-8873.003.patch, HDFS-8873.004.patch > > > The new 2-level directory layout can make directory scans expensive in terms > of disk seeks (see HDFS-8791) for details. > It would be good if the directoryScanner() had a configurable duty cycle that > would reduce its impact on disk performance (much like the approach in > HDFS-8617). > Without such a throttle, disks can go 100% busy for many minutes at a time > (assuming the common case of all inodes in cache but no directory blocks > cached, 64K seeks are required for full directory listing which translates to > 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8894) Set SO_KEEPALIVE on DN server sockets
Nathan Roberts created HDFS-8894: Summary: Set SO_KEEPALIVE on DN server sockets Key: HDFS-8894 URL: https://issues.apache.org/jira/browse/HDFS-8894 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.1 Reporter: Nathan Roberts SO_KEEPALIVE is not set on things like datastreamer sockets which can cause lingering ESTABLISHED sockets when there is a network glitch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8873) throttle directoryScanner
Nathan Roberts created HDFS-8873: Summary: throttle directoryScanner Key: HDFS-8873 URL: https://issues.apache.org/jira/browse/HDFS-8873 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.1 Reporter: Nathan Roberts The new 2-level directory layout can make directory scans expensive in terms of disk seeks (see HDFS-8791) for details. It would be good if the directoryScanner() had a configurable duty cycle that would reduce its impact on disk performance (much like the approach in HDFS-8617). Without such a throttle, disks can go 100% busy for many minutes at a time (assuming the common case of all inodes in cache but no directory blocks cached, 64K seeks are required for full directory listing which translates to 655 seconds) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654426#comment-14654426 ] Nathan Roberts commented on HDFS-8791: -- My preference would be to take a smaller incremental step. How about: - New layout where n x m levels are configurable (today 256x256) - n x m is recorded in version file - Upgrade path is taken if configured n x m is different from n x m in VERSION file Seems like most of the code will work without too much modification (and the risk that comes with it). I fear if we try to take too much of a step at this point, it will take significant time to settle on the new layout, and then it will end up being either extremely close to what we have now OR it will be radically different and require a lot of investment of time and resources to even get there. In other words, I think we need a short term layout change that is low-risk and quick to integrate. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653974#comment-14653974 ] Nathan Roberts commented on HDFS-8791: -- Curious what folks would think about going back to previous layout? I understand there was some benefit to the new layout but maybe there are nearly equivalent and less-intrusive ways to achieve the same benefits. I'm confident the current layout is going to cause significant performance issues for HDFS, and latency sensitive applications (e.g. Hbase) are going to feel this in a big way. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab
[ https://issues.apache.org/jira/browse/HDFS-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635299#comment-14635299 ] Nathan Roberts commented on HDFS-6407: -- My understanding is that the Legacy UI was removed in 2.7. With the legacyUI gone, we've lost very valuable functionality. I use the sort capability all of the time to do things like: find nodes running different versions during a rolling upgrade, evaluate how the balancer is doing by sorting on capacity, find very full nodes to see how their disks are performing, sort on Admin state to find all decomissioning nodes . I don't think it's a blocker for a release, but a loss of commonly used functionality can be very annoying for users. new namenode UI, lost ability to sort columns in datanode tab - Key: HDFS-6407 URL: https://issues.apache.org/jira/browse/HDFS-6407 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Nathan Roberts Assignee: Benoy Antony Priority: Minor Labels: BB2015-05-TBR Attachments: 002-datanodes-sorted-capacityUsed.png, 002-datanodes.png, 002-filebrowser.png, 002-snapshots.png, HDFS-6407-002.patch, HDFS-6407-003.patch, HDFS-6407.patch, browse_directory.png, datanodes.png, snapshots.png old ui supported clicking on column header to sort on that column. The new ui seems to have dropped this very useful feature. There are a few tables in the Namenode UI to display datanodes information, directory listings and snapshots. When there are many items in the tables, it is useful to have ability to sort on the different columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635908#comment-14635908 ] Nathan Roberts commented on HDFS-8791: -- bq. I'm having trouble understanding these kernel settings. http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning says that When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes. So that would seem to indicate that vfs_cache_pressure does have control over dentries (i.e. the directory blocks which contain the list of child inodes). What settings have you used for vfs_cache_pressure so far? Not a linux filesystem expert, but here's where I think the confusion is: - inodes are cached in ext4_inode slab - dentries are cached in dentry slab - directory blocks are cached in the buffer cache - lookups (e.g. stat /subdir1/subdir2/blk_0) can be satisfied with the dentry+inode cache - readdir cannot be satisfied by the dentry cache, it needs to see the blocks from the disk (hence the buffer cache) I can somewhat protect the inode+dentry by setting vfs_cache_pressure to 1 (setting to 0 can be very bad because negative dentries can fill up your entire memory, I think). I tried setting vfs_cache_pressure to 0, and it didn't seem to help the case we are seeing. I used blktrace to capture what was happening when a node was doing this. I then dumped the raw data at the offsets captured by blktrace. The data showed that the seeks were all the result of reading directory blocks, not inodes. bq. I think if we're going to change the on-disk layout format again, we should change the way we name meta files. Currently, we encode the genstamp in the file name, like blk_1073741915_1091.meta. This means that to look up the meta file for block 1073741915, we have to iterate through every file in the subdirectory until we find it. Instead, we could simply name the meta file as blk_107374191.meta and put the genstamp number in the meta file header. This would allow us to move to a scheme which had a very large number of blocks in each directory (perhaps a simple 1-level hashing scheme) and the dentries would always be hot. ext4 and other modern Linux filesystems deal very effectively with large directories-- it's only ext2 and ext3 without certain options enabled that had problems. I'm a little confused about iterating to find the meta file. Don't we already keep track of the genstamp we discovered during startup? If so, it seems like a simple stat is sufficient. I haven't tried xfs, but that would also be a REALLY heavy hammer in our case;) block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633669#comment-14633669 ] Nathan Roberts commented on HDFS-8791: -- Hi [~cmccabe]. Thanks for the idea. Yes, I had actually tried something like that. I actually just kept a loop of DU's running on the node (outside of the datanode process for simplicity sake). I thought this would prevent it from happening but it turns out it still gets into this situation. I suspect the reason is that when there is memory pressure, it will start to seek a little, and then once it starts to seek a little the system quickly degrades because buffers are being thrown away faster than the disks can seek. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633804#comment-14633804 ] Nathan Roberts commented on HDFS-8791: -- I agree we should optimize all the potential scans (du, checkDirs, directoryScanner, etc) I also think we need to do something more general because I feel like people will trip on this in all sorts of ways. Even tools outside of the DN process that do periodic scans will be affected and will in-turn adversely affect the datenode's performance. Also, it's hard to see this problem until you're running at scale so it will be difficult to catch jiras that introduce yet another scan, because they run really fast when everything is in memory. I'm wondering if we shouldn't move to a hashing scheme that is more dynamic and grows/shrinks based on the number of blocks in the volume. A consistent hash to minimize renames, plus some logic that knows how to look in two places (old hash, new hash), seems like it might work. We could set a threshold of avg 100 blocks per directory, when we cross that threshold then we add enough subdirs to bring the avg down to 95. I think ext2 and ext3 will see a similar problem. Are you seeing something different? I'll admit that my understanding of the differences isn't exhaustive, but it sure seems like all of them rely on the buffer cache to maintain directory blocks and all of them try to spread directories across the disk, so they'd all be subject to the same sort of thing. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633675#comment-14633675 ] Nathan Roberts commented on HDFS-8791: -- I forgot to mention that I'm pretty confident it's not the inodes, but rather the directory blocks. inodes have their own cache that I can control with vfs_cache_pressure. directory blocks however are just cached via the buffer cache (afaik), and the buffer cache is much more difficult to have any control over. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631334#comment-14631334 ] Nathan Roberts commented on HDFS-8791: -- Sure. Randomly sampled node had 4 4TB drives, each had right around 38250 directories with at least one block, so 64K-38250 were empty. The drives were about 80% full. I see how there might be an optimization there, but I think we need to find a way to solve it for the more general case. Either the DN must never scan (or at least scan at a rate that will not be intrusive), or maybe we should reconsider the 64K breadth - a small number of files per directory is probably going to cause performance issues on many filesystems. block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.1 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
Nathan Roberts created HDFS-8791: Summary: block ID-based DN storage layout can be very slow for datanode on ext4 Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.1 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4
[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630493#comment-14630493 ] Nathan Roberts commented on HDFS-8791: -- For reference, the stack trace when du is obviously blocking on disk I/O {noformat} [811bf1a0] sync_buffer+0x40/0x50 [811bf156] __wait_on_buffer+0x26/0x30 [a02fd9a4] ext4_bread+0x64/0x80 [ext4] [a0302aa8] htree_dirblock_to_tree+0x38/0x190 [ext4] [a0303548] ext4_htree_fill_tree+0xa8/0x260 [ext4] [a02f43c7] ext4_readdir+0x127/0x700 [ext4] [8119f030] vfs_readdir+0xc0/0xe0 [8119f1b9] sys_getdents+0x89/0xf0 [8100b072] system_call_fastpath+0x16/0x1b {noformat} block ID-based DN storage layout can be very slow for datanode on ext4 -- Key: HDFS-8791 URL: https://issues.apache.org/jira/browse/HDFS-8791 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.1 Reporter: Nathan Roberts Priority: Critical We are seeing cases where the new directory layout causes the datanode to basically cause the disks to seek for 10s of minutes. This can be when the datanode is running du, and it can also be when it is performing a checkDirs(). Both of these operations currently scan all directories in the block pool and that's very expensive in the new layout. The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf directories where block files are placed. So, what we have on disk is: - 256 inodes for the first level directories - 256 directory blocks for the first level directories - 256*256 inodes for the second level directories - 256*256 directory blocks for the second level directories - Then the inodes and blocks to store the the HDFS blocks themselves. The main problem is the 256*256 directory blocks. inodes and dentries will be cached by linux and one can configure how likely the system is to prune those entries (vfs_cache_pressure). However, ext4 relies on the buffer cache to cache the directory blocks and I'm not aware of any way to tell linux to favor buffer cache pages (even if it did I'm not sure I would want it to in general). Also, ext4 tries hard to spread directories evenly across the entire volume, this basically means the 64K directory blocks are probably randomly spread across the entire disk. A du type scan will look at directories one at a time, so the ioscheduler can't optimize the corresponding seeks, meaning the seeks will be random and far. In a system I was using to diagnose this, I had 60K blocks. A DU when things are hot is less than 1 second. When things are cold, about 20 minutes. How do things get cold? - A large set of tasks run on the node. This pushes almost all of the buffer cache out, causing the next DU to hit this situation. We are seeing cases where a large job can cause a seek storm across the entire cluster. Why didn't the previous layout see this? - It might have but it wasn't nearly as pronounced. The previous layout would be a few hundred directory blocks. Even when completely cold, these would only take a few a hundred seeks which would mean single digit seconds. - With only a few hundred directories, the odds of the directory blocks getting modified is quite high, this keeps those blocks hot and much less likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588695#comment-14588695 ] Nathan Roberts commented on HDFS-4660: -- +1 on the patch. I have reviewed the patch previously and it is currently running in production at scale. The stress test we ran against this in https://issues.apache.org/jira/browse/HDFS-4660?focusedCommentId=14542862page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14542862 heavily exercised this path. Block corruption can happen during pipeline recovery Key: HDFS-4660 URL: https://issues.apache.org/jira/browse/HDFS-4660 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.0.3-alpha Reporter: Peng Zhang Assignee: Kihwal Lee Priority: Blocker Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch pipeline DN1 DN2 DN3 stop DN2 pipeline added node DN4 located at 2nd position DN1 DN4 DN3 recover RBW DN4 after recover rbw 2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004 2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW getNumBytes() = 134144 getBytesOnDisk() = 134144 getVisibleLength()= 134144 end at chunk (134144/512=262) DN3 after recover rbw 2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW getNumBytes() = 134028 getBytesOnDisk() = 134028 getVisibleLength()= 134028 client send packet after recover pipeline offset=133632 len=1008 DN4 after flush 2013-04-01 21:02:31,779 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1063 // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 1063. DN3 after flush 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, lastPacketInBlock=false, offsetInBlock=134640, ackEnqueueNanoTime=8817026136871545) 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing meta file offset of block BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 1055 to 1051 2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1059 After checking meta on DN4, I found checksum of chunk 262 is duplicated, but data not. Later after block was finalized, DN4's scanner detected bad block, and then reported it to NM. NM send a command to delete this block, and replicate this block from other DN in pipeline to satisfy duplication num. I think this is because in BlockReceiver it skips data bytes already written, but not skips checksum bytes already written. And function adjustCrcFilePosition is only used for last non-completed chunk, but not for this situation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp
[ https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-8404: - Attachment: HDFS-8404-v1.patch Thanks [~kihwal]. Updated based on latest trunk. pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0, 2.7.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-8404-v0.patch, HDFS-8404-v1.patch If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp
[ https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-8404: - Status: Open (was: Patch Available) pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0, 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-8404-v0.patch If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp
[ https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-8404: - Status: Patch Available (was: Open) pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0, 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-8404-v0.patch, HDFS-8404-v1.patch If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-8404) pending block replication can get stuck using older genstamp
Nathan Roberts created HDFS-8404: Summary: pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0, 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp
[ https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-8404: - Attachment: HDFS-8404-v0.patch pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0, 2.7.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-8404-v0.patch If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp
[ https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-8404: - Status: Patch Available (was: Open) pending block replication can get stuck using older genstamp Key: HDFS-8404 URL: https://issues.apache.org/jira/browse/HDFS-8404 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.0, 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-8404-v0.patch If an under-replicated block gets into the pending-replication list, but later the genstamp of that block ends up being newer than the one originally submitted for replication, the block will fail replication until the NN is restarted. It will be safer if processPendingReplications() gets up-to-date blockinfo before resubmitting replication work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-8041) Consider remaining space during block blockplacement if dfs space is highly utilized
[ https://issues.apache.org/jira/browse/HDFS-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393437#comment-14393437 ] Nathan Roberts commented on HDFS-8041: -- Hi [~kihwal]. Some minor comments on the patch + Can we bounds check the new config? I think it works fine even without it but just to be safe against a change to the algorithm in the future. + I wish there was a way to make this config refreshable. Unfortunately I don't think that's possible today. + Should we protect against stats.getNumDatanodesInService being 0. Again, probably ok as it is today but just to avoid a future patch from breaking the assumptions. + Node local writes are not impacted by the change. Maybe we should also have rack-local writes avoid this check so that the 2nd and 3rd replicas remain in the same rack. I think just having this impact the completely random target selections might be enough to avoid the problem while minimizing the affects on block placement. Consider remaining space during block blockplacement if dfs space is highly utilized Key: HDFS-8041 URL: https://issues.apache.org/jira/browse/HDFS-8041 Project: Hadoop HDFS Issue Type: Improvement Reporter: Kihwal Lee Assignee: Kihwal Lee Attachments: HDFS-8041.v1.patch, HDFS-8041.v2.patch This feature is helpful in avoiding smaller nodes (i.e. heterogeneous environment) getting constantly being full when the overall space utilization is over a certain threshold. When the utilization is low, balancer can keep up, but once the average per-node byte goes over the capacity of the smaller nodes, they get full so quickly even after perfect balance. This jira proposes an improvement that can be optionally enabled in order to slow down the rate of space usage growth of smaller nodes if the overall storage utilization is over a configured threshold. It will not replace balancer, rather will help balancer keep up. Also, the primary replica placement will not be affected. Only the replicas typically placed in a remote rack will be subject to this check. The appropriate threshold is cluster configuration specific. There is no generally good value to set, thus it is disabled by default. We have seen cases where the threshold of 85% - 90% would help. Figuring when {{totalSpaceUsed / numNodes}} becomes close to the capacity of a smaller node is helpful in determining the threshold. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
[ https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386724#comment-14386724 ] Nathan Roberts commented on HDFS-7742: -- Thanks for the review nicholas! There is a test in the patch. Are you asking for a specific test case to be added? The test failure from the QA bot (TestMalformedURLs) should be unrelated. favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-7742-v0.patch When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get stuck on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
[ https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-7742: - Target Version/s: 3.0.0 (was: 3.0.0, 2.7.0) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-7742-v0.patch When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get stuck on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
[ https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-7742: - Attachment: HDFS-7742-v0.patch Attached patch. Favors decommissioning nodes a bit by allowing them to go up to hard limit, otherwise not at all. favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-7742-v0.patch When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get stuck on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
[ https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-7742: - Target Version/s: 3.0.0, 2.7.0 (was: 3.0.0) Status: Patch Available (was: Open) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-7742-v0.patch When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get stuck on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods
Nathan Roberts created HDFS-7742: Summary: favoring decommissioning node for replication can cause a block to stay underreplicated for long periods Key: HDFS-7742 URL: https://issues.apache.org/jira/browse/HDFS-7742 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.6.0 Reporter: Nathan Roberts Assignee: Nathan Roberts When choosing a source node to replicate a block from, a decommissioning node is favored. The reason for the favoritism is that decommissioning nodes aren't servicing any writes so in-theory they are less loaded. However, the same selection algorithm also tries to make sure it doesn't get stuck on any particular node: {noformat} // switch to a different node randomly // this to prevent from deterministically selecting the same node even // if the node failed to replicate the block on previous iterations {noformat} Unfortunately, the decommissioning check is prior to this randomness so the algorithm can get stuck trying to replicate from a decommissioning node. We've seen this in practice where a decommissioning datanode was failing to replicate a block for many days, when other viable replicas of the block were available. Given that we limit the number of streams we'll assign to a given node (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a decommissioning node has significant benefit. i.e. when there is significant replication work to do, we'll quickly hit the stream limit of the decommissioning nodes and use other nodes in the cluster anyway; when there isn't significant replication work then in theory we've got plenty of replication bandwidth available so choosing a decommissioning node isn't much of a win. I see two choices: 1) Change the algorithm to still favor decommissioning nodes but with some level of randomness that will avoid always selecting the decommissioning node 2) Remove the favoritism for decommissioning nodes I prefer #2. It simplifies the algorithm, and given the other throttles we have in place, I'm not sure there is a significant benefit to selecting decommissioning nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times
Nathan Roberts created HDFS-7645: Summary: Rolling upgrade is restoring blocks from trash multiple times Key: HDFS-7645 URL: https://issues.apache.org/jira/browse/HDFS-7645 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.6.0 Reporter: Nathan Roberts When performing an HDFS rolling upgrade, the trash directory is getting restored twice when under normal circumstances it shouldn't need to be restored at all. iiuc, the only time these blocks should be restored is if we need to rollback a rolling upgrade. On a busy cluster, this can cause significant and unnecessary block churn both on the datanodes, and more importantly in the namenode. The two times this happens are: 1) restart of DN onto new software {code} private void doTransition(DataNode datanode, StorageDirectory sd, NamespaceInfo nsInfo, StartupOption startOpt) throws IOException { if (startOpt == StartupOption.ROLLBACK sd.getPreviousDir().exists()) { Preconditions.checkState(!getTrashRootDir(sd).exists(), sd.getPreviousDir() + and + getTrashRootDir(sd) + should not + both be present.); doRollback(sd, nsInfo); // rollback if applicable } else { // Restore all the files in the trash. The restored files are retained // during rolling upgrade rollback. They are deleted during rolling // upgrade downgrade. int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd)); LOG.info(Restored + restored + block files from trash.); } {code} 2) When heartbeat response no longer indicates a rollingupgrade is in progress {code} /** * Signal the current rolling upgrade status as indicated by the NN. * @param inProgress true if a rolling upgrade is in progress */ void signalRollingUpgrade(boolean inProgress) throws IOException { String bpid = getBlockPoolId(); if (inProgress) { dn.getFSDataset().enableTrash(bpid); dn.getFSDataset().setRollingUpgradeMarker(bpid); } else { dn.getFSDataset().restoreTrash(bpid); dn.getFSDataset().clearRollingUpgradeMarker(bpid); } } {code} HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely clear whether this is somehow intentional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it
[ https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273764#comment-14273764 ] Nathan Roberts commented on HDFS-7548: -- - I think we need to prioritize a scan for that block. - Also, some comments on addBlockToFirstLocation(). - imo, WARN should be INFO. - If this block has been scanned in the last 5 minutes (or some reasonable time frame), then maybe we shouldn't add it back to the list of blocks to be scanned. If all IOExceptions are going to re-prioritize the scan of a block, having a minimum delay between scans would avoid corner cases where a network glitch or badly behaving clients are causing IOExceptions that don't really warrant rescans. Corrupt block reporting delayed until datablock scanner thread detects it - Key: HDFS-7548 URL: https://issues.apache.org/jira/browse/HDFS-7548 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.5.0 Reporter: Rushabh S Shah Assignee: Rushabh S Shah Attachments: HDFS-7548.patch When there is one datanode holding the block and that block happened to be corrupt, namenode would keep on trying to replicate the block repeatedly but it would only report the block as corrupt only when the data block scanner thread of the datanode picks up this bad block. Requesting improvement in namenode reporting so that corrupt replica would be reported when there is only 1 replica and the replication of that replica keeps on failing with the checksum error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it
[ https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268347#comment-14268347 ] Nathan Roberts commented on HDFS-7548: -- I think we need to handle the java.io.IOException: Input/output error case as well since this is what we'll see if having trouble reading from disk. Corrupt block reporting delayed until datablock scanner thread detects it - Key: HDFS-7548 URL: https://issues.apache.org/jira/browse/HDFS-7548 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.5.0 Reporter: Rushabh S Shah Assignee: Rushabh S Shah Attachments: HDFS-7548.patch When there is one datanode holding the block and that block happened to be corrupt, namenode would keep on trying to replicate the block repeatedly but it would only report the block as corrupt only when the data block scanner thread of the datanode picks up this bad block. Requesting improvement in namenode reporting so that corrupt replica would be reported when there is only 1 replica and the replication of that replica keeps on failing with the checksum error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2825) Add test hook to turn off the writer preferring its local DN
[ https://issues.apache.org/jira/browse/HDFS-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248733#comment-14248733 ] Nathan Roberts commented on HDFS-2825: -- [~tlipcon] - It might help. I'm a little concerned about the additional round trip and how we might enable it globally for an entire cluster. We'll be running some experiments with a global config for the block placement policy, and then go from there. Add test hook to turn off the writer preferring its local DN Key: HDFS-2825 URL: https://issues.apache.org/jira/browse/HDFS-2825 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor Fix For: 0.24.0, 0.23.1 Attachments: hdfs-2825.txt, hdfs-2825.txt Currently, the default block placement policy always places the first replica in the pipeline on the local node if there is a valid DN running there. In some network designs, within-rack bandwidth is never constrained so this doesn't give much of an advantage. It would also be really useful to disable this for MiniDFSCluster tests, since currently if you start a multi-DN cluster and write with replication level 1, all of the replicas go to the same DN. _[per discussion below, this was changed to not add a config, but only to add a hook for testing]_ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-2825) Add test hook to turn off the writer preferring its local DN
[ https://issues.apache.org/jira/browse/HDFS-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232257#comment-14232257 ] Nathan Roberts commented on HDFS-2825: -- Any chance folks have altered their view on this and would accept a config option in the default placement policy? I'm happy to file the jira and put up the patch. It's very straightforward and makes sense for a lot of cases. imho, it actually makes more sense as the default than node-local (except for obvious things like Hbase). Today with node-local-first we wind up with some very hot nodes that dramatically slow down subsequent jobs, all because there is an entire copy of a large file on a single node of a rack. Add test hook to turn off the writer preferring its local DN Key: HDFS-2825 URL: https://issues.apache.org/jira/browse/HDFS-2825 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 0.23.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor Fix For: 0.24.0, 0.23.1 Attachments: hdfs-2825.txt, hdfs-2825.txt Currently, the default block placement policy always places the first replica in the pipeline on the local node if there is a valid DN running there. In some network designs, within-rack bandwidth is never constrained so this doesn't give much of an advantage. It would also be really useful to disable this for MiniDFSCluster tests, since currently if you start a multi-DN cluster and write with replication level 1, all of the replicas go to the same DN. _[per discussion below, this was changed to not add a config, but only to add a hook for testing]_ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list
[ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064095#comment-14064095 ] Nathan Roberts commented on HDFS-6658: -- {quote} I guess my argument is that (in the short or medium term) we don't actually need to reduce the amount of RAM the NameNode uses. I've seen machines with 300 GB of RAM, and sizes continue to increase at a steady clip every year. We do need to reduce the amount of Java heap that the NameNode uses, since otherwise we get 10 minute long GC pauses. {quote} This is a pretty sizable improvement though so it seems well worth considering. * One thing I'm concerned about is the increased RAM requirements that have been going on in the NN. For example, moving from 0.23 releases to 2.x releases requires about 9% more RAM (I'm assuming it's something similar when going from 1.x to 2.x). This is a pretty big deal and can cause some folks to fail their upgrade if they were living close to the edge. In my opinion we need to be very careful whenever we increase the RAM requirements of the NN. For every increase there should be a corresponding optimization so the net increase stays as close to 0 as possible. Otherwise, some upgrades will certainly fail. * I'm not totally convinced of the long GC argument. It's true that a worst case full-gc will be much longer. However, isn't it also the case that we should almost never be doing worst case full-GCs? On a large and busy NN, we see a GC greater than 2 seconds maybe once every couple of days. Usually the big outliers are the result of a very large application doing something bad - in which case even if you solve the GC problem, something else is liable to cause the NN to be unresponsive. Namenode memory optimization - Block replicas list --- Key: HDFS-6658 URL: https://issues.apache.org/jira/browse/HDFS-6658 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.1 Reporter: Amir Langer Assignee: Amir Langer Attachments: Namenode Memory Optimizations - Block replicas list.docx Part of the memory consumed by every BlockInfo object in the Namenode is a linked list of block references for every DatanodeStorageInfo (called triplets). We propose to change the way we store the list in memory. Using primitive integer indexes instead of object references will reduce the memory needed for every block replica (when compressed oops is disabled) and in our new design the list overhead will be per DatanodeStorageInfo and not per block replica. see attached design doc. for details and evaluation results. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list
[ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060740#comment-14060740 ] Nathan Roberts commented on HDFS-6658: -- Maybe have a simple fragmentation metric and if it exceeds X% for an extended period of time (like hours), then clean it up. Yes, some client will have higher latency. But it's only once in many hours and I doubt it's for very long anyway (milliseconds). It's kind of a bazaar situation so I don't think we're in a hurry to clean it up, but it's also better if we don't let it sit around forever. Namenode memory optimization - Block replicas list --- Key: HDFS-6658 URL: https://issues.apache.org/jira/browse/HDFS-6658 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.1 Reporter: Amir Langer Assignee: Amir Langer Attachments: Namenode Memory Optimizations - Block replicas list.docx Part of the memory consumed by every BlockInfo object in the Namenode is a linked list of block references for every DatanodeStorageInfo (called triplets). We propose to change the way we store the list in memory. Using primitive integer indexes instead of object references will reduce the memory needed for every block replica (when compressed oops is disabled) and in our new design the list overhead will be per DatanodeStorageInfo and not per block replica. see attached design doc. for details and evaluation results. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6584) Support archival storage
[ https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060815#comment-14060815 ] Nathan Roberts commented on HDFS-6584: -- As part of this work, would it make sense to extend the policy functionality so that we can control the BlockPlacementPolicy used during file creation? Here's the use case which we run into quite frequently: 1-2GB file generated by JobA is used as a distributed cache file for Job B which is rather large (several thousand tasks). Assuming a single task from JobA writes this file, an entire copy of the file will be on a single node, and in no other nodes in that same rack (default policy is 1st blk local, 2nd replica on remote rack, 3rd replica on same rack as 2nd). When a large job needs this distributed cache file, the node where there is a single copy will become a significant bottleneck and is likely to cause localization timeouts. This is with replication factors set to 50+, so just increasing the replication factor does not solve this problem. It would be good if JobA could specify a BlockPlacementPolicy which would do 1st replica rack local, 2nd replica remote rack, 3rd replica same as 2nd (in general though it would be good if JobA could ask for any 1 of n placement policies). Support archival storage Key: HDFS-6584 URL: https://issues.apache.org/jira/browse/HDFS-6584 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, namenode Reporter: Tsz Wo Nicholas Sze Assignee: Tsz Wo Nicholas Sze Attachments: HDFSArchivalStorageDesign20140623.pdf In most of the Hadoop clusters, as more and more data is stored for longer time, the demand for storage is outstripping the compute. Hadoop needs a cost effective and easy to manage solution to meet this demand for storage. Current solution is: - Delete the old unused data. This comes at operational cost of identifying unnecessary data and deleting them manually. - Add more nodes to the clusters. This adds along with storage capacity unnecessary compute capacity to the cluster. Hadoop needs a solution to decouple growing storage capacity from compute capacity. Nodes with higher density and less expensive storage with low compute power are becoming available and can be used as cold storage in the clusters. Based on policy the data from hot storage can be moved to cold storage. Adding more nodes to the cold storage can grow the storage independent of the compute capacity in the cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list
[ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057858#comment-14057858 ] Nathan Roberts commented on HDFS-6658: -- Thanks Amir. Very attractive memory savings. Namenode memory optimization - Block replicas list --- Key: HDFS-6658 URL: https://issues.apache.org/jira/browse/HDFS-6658 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.4.1 Reporter: Amir Langer Assignee: Amir Langer Attachments: Namenode Memory Optimizations - Block replicas list.docx Part of the memory consumed by every BlockInfo object in the Namenode is a linked list of block references for every DatanodeStorageInfo (called triplets). We propose to change the way we store the list in memory. Using primitive integer indexes instead of object references will reduce the memory needed for every block replica (when compressed oops is disabled) and in our new design the list overhead will be per DatanodeStorageInfo and not per block replica. see attached design doc. for details and evaluation results. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab
Nathan Roberts created HDFS-6407: Summary: new namenode UI, lost ability to sort columns in datanode tab Key: HDFS-6407 URL: https://issues.apache.org/jira/browse/HDFS-6407 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.0 Reporter: Nathan Roberts Priority: Minor old ui supported clicking on column header to sort on that column. The new ui seems to have dropped this very useful feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6247) Avoid timeouts for replaceBlock() call by sending intermediate responses to Balancer
[ https://issues.apache.org/jira/browse/HDFS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971837#comment-13971837 ] Nathan Roberts commented on HDFS-6247: -- +1 This will be better than the longish timeout that's in there currently. Would be good to shorten the read time out back down once this sort of heartbeat is in place. Avoid timeouts for replaceBlock() call by sending intermediate responses to Balancer Key: HDFS-6247 URL: https://issues.apache.org/jira/browse/HDFS-6247 Project: Hadoop HDFS Issue Type: Bug Components: balancer, datanode Affects Versions: 2.4.0 Reporter: Vinayakumar B Assignee: Vinayakumar B Currently there is no response sent from target Datanode to Balancer for the replaceBlock() calls. Since the Block movement for balancing is throttled, complete block movement will take time and this could result in timeout at Balancer, which will be trying to read the status message. To Avoid this during replaceBlock() call in in progress Datanode can send IN_PROGRESS status messages to Balancer to avoid timeouts and treat BlockMovement as failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-6166: - Attachment: HDFS-6166-branch23.patch Patch for branch 23 revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.3.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Fix For: 2.4.0 Attachments: HDFS-6166-branch23.patch, HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951461#comment-13951461 ] Nathan Roberts commented on HDFS-6166: -- Tested on a 400 node cluster with a bandwidth of 500K/sec. Verified that there are still occasional timeouts BUT there is not the flood of thread quota exceeded warnings. revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.3.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Attachments: HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951493#comment-13951493 ] Nathan Roberts commented on HDFS-6166: -- Maybe our two comments passed in the mail. Yes I tested internally. It's been running on a 400 node cluster for 1 day. I ran with bandwidths of 500K, 6MB, 20MB. With 500K there were timeouts, but no thread quota exceeded failures. revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.3.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Attachments: HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951509#comment-13951509 ] Nathan Roberts commented on HDFS-6166: -- The blocks don't have to be very large. There is a quota of 5 threads per DN, at the default bandwidth of 1MB/sec, it can take (block size) / (1MB/5) seconds to move a block (something like 640 seconds for a 128MB block). The bandwidth is dynamically settable and the block size is not constant either, so I went with the very simple approach that will cover the normal situations. revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.3.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Attachments: HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6166) revisit balancer so_timeout
Nathan Roberts created HDFS-6166: Summary: revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.3.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-6166: - Attachment: HDFS-6166.patch proposed patch. revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.3.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Attachments: HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6166) revisit balancer so_timeout
[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-6166: - Target Version/s: 3.0.0, 2.4.0 (was: 2.4.0) Status: Patch Available (was: Open) revisit balancer so_timeout Key: HDFS-6166 URL: https://issues.apache.org/jira/browse/HDFS-6166 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.3.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Priority: Blocker Attachments: HDFS-6166.patch HDFS-5806 changed the socket read timeout for the balancer connection to DN to 60 seconds. This works as long as balancer bandwidth is such that it's safe to assume that the DN will easily complete the operation within this time. Obviously this isn't a good assumption. When this assumption isn't valid, the balancer will timeout the cmd BUT it will then be out-of-sync with the datanode (balancer thinks the DN has room to do more work, DN is still working on the request and will fail any subsequent requests with threads quota exceeded errors). This causes expensive NN traffic via getBlocks() and also causes lots of WARNS int the balancer log. Unfortunately the protocol is such that it's impossible to tell if the DN is busy working on replacing the block, OR is in bad shape and will never finish. So, in the interest of a small change to deal with both situations, I propose the following two changes: * Crank of the socket read timeout to 20 minutes * Delay looking at a node for a bit if we did timeout in this way (the DN could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
[ https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5806: - Attachment: HDFS-5806-0.23.patch 0.23 version of patch balancer should set SoTimeout to avoid indefinite hangs --- Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.2.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Fix For: 2.3.0 Attachments: HDFS-5806-0.23.patch, HDFS-5806.patch Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5788) listLocatedStatus response can be very large
[ https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5788: - Attachment: HDFS-5788.patch patch for trunk. listLocatedStatus response can be very large Key: HDFS-5788 URL: https://issues.apache.org/jira/browse/HDFS-5788 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.0.0, 0.23.10, 2.2.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-5788.patch Currently we limit the size of listStatus requests to a default of 1000 entries. This works fine except in the case of listLocatedStatus where the location information can be quite large. As an example, a directory with 7000 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is over 1MB. This can chew up very large amounts of memory in the NN if lots of clients try to do this simultaneously. Seems like it would be better if we also considered the amount of location information being returned when deciding how many files to return. Patch will follow shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5788) listLocatedStatus response can be very large
[ https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5788: - Status: Patch Available (was: Open) listLocatedStatus response can be very large Key: HDFS-5788 URL: https://issues.apache.org/jira/browse/HDFS-5788 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.2.0, 0.23.10, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-5788.patch Currently we limit the size of listStatus requests to a default of 1000 entries. This works fine except in the case of listLocatedStatus where the location information can be quite large. As an example, a directory with 7000 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is over 1MB. This can chew up very large amounts of memory in the NN if lots of clients try to do this simultaneously. Seems like it would be better if we also considered the amount of location information being returned when deciding how many files to return. Patch will follow shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
[ https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879273#comment-13879273 ] Nathan Roberts commented on HDFS-5806: -- Andrew, thanks for taking a look. Sorry about not mentioning the testing. Didn't have great ideas on how to test. Basically did the following - Changed balancer so that sotimeout was 1 second - Changed balancer so that sleeptime between iterations was 2 seconds - Changed dispatch() within balancer to randomly not send the request - this causes the response read to timeout due to sotimeout - Made sure TestBalancer still worked balancer should set SoTimeout to avoid indefinite hangs --- Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.2.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-5806.patch Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
Nathan Roberts created HDFS-5806: Summary: balancer should set SoTimeout to avoid indefinite hangs Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.2.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
[ https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5806: - Attachment: HDFS-5806.patch use setSoTimeout() to avoid read hangs. balancer should set SoTimeout to avoid indefinite hangs --- Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 3.0.0, 2.2.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-5806.patch Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs
[ https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5806: - Status: Patch Available (was: Open) balancer should set SoTimeout to avoid indefinite hangs --- Key: HDFS-5806 URL: https://issues.apache.org/jira/browse/HDFS-5806 Project: Hadoop HDFS Issue Type: Bug Components: balancer Affects Versions: 2.2.0, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Attachments: HDFS-5806.patch Simple patch to avoid the balancer hanging when datanode stops responding to requests. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5788) listLocatedStatus response can be very large
Nathan Roberts created HDFS-5788: Summary: listLocatedStatus response can be very large Key: HDFS-5788 URL: https://issues.apache.org/jira/browse/HDFS-5788 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.2.0, 0.23.10, 3.0.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Currently we limit the size of listStatus requests to a default of 1000 entries. This works fine except in the case of listLocatedStatus where the location information can be quite large. As an example, a directory with 7000 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is over 1MB. This can chew up very large amounts of memory in the NN if lots of clients try to do this simultaneously. Seems like it would be better if we also considered the amount of location information being returned when deciding how many files to return. Patch will follow shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5788) listLocatedStatus response can be very large
[ https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874107#comment-13874107 ] Nathan Roberts commented on HDFS-5788: -- A simple solution is: Restrict the size to dfs.ls.limit (default 1000) files OR dfs.ls.limit block locations, whichever comes first (obviously always returning only whole entries, so we could send more than this number of locations) Yes, it will require more RPCs. However, it would seem to lower the risk of a DoS. listLocatedStatus response can be very large Key: HDFS-5788 URL: https://issues.apache.org/jira/browse/HDFS-5788 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.0.0, 0.23.10, 2.2.0 Reporter: Nathan Roberts Assignee: Nathan Roberts Currently we limit the size of listStatus requests to a default of 1000 entries. This works fine except in the case of listLocatedStatus where the location information can be quite large. As an example, a directory with 7000 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is over 1MB. This can chew up very large amounts of memory in the NN if lots of clients try to do this simultaneously. Seems like it would be better if we also considered the amount of location information being returned when deciding how many files to return. Patch will follow shortly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5477) Block manager as a service
[ https://issues.apache.org/jira/browse/HDFS-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5477: - Attachment: Standalone BM.pdf Re-attach standalone pdf to fix graphics. Block manager as a service -- Key: HDFS-5477 URL: https://issues.apache.org/jira/browse/HDFS-5477 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Attachments: Proposal.pdf, Proposal.pdf, Standalone BM.pdf, Standalone BM.pdf The block manager needs to evolve towards having the ability to run as a standalone service to improve NN vertical and horizontal scalability. The goal is reducing the memory footprint of the NN proper to support larger namespaces, and improve overall performance by decoupling the block manager from the namespace and its lock. Ideally, a distinct BM will be transparent to clients and DNs. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (HDFS-5477) Block manager as a service
[ https://issues.apache.org/jira/browse/HDFS-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5477: - Attachment: Proposal.pdf Fix formatting problems in PDF. Block manager as a service -- Key: HDFS-5477 URL: https://issues.apache.org/jira/browse/HDFS-5477 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Attachments: Proposal.pdf, Proposal.pdf, Standalone BM.pdf The block manager needs to evolve towards having the ability to run as a standalone service to improve NN vertical and horizontal scalability. The goal is reducing the memory footprint of the NN proper to support larger namespaces, and improve overall performance by decoupling the block manager from the namespace and its lock. Ideally, a distinct BM will be transparent to clients and DNs. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades
[ https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836540#comment-13836540 ] Nathan Roberts commented on HDFS-5535: -- Hi. Initial draft is still a little rough. Will try to get it up Tuesday for initial comments and suggestions. Umbrella jira for improved HDFS rolling upgrades Key: HDFS-5535 URL: https://issues.apache.org/jira/browse/HDFS-5535 Project: Hadoop HDFS Issue Type: New Feature Components: datanode, ha, hdfs-client, namenode Affects Versions: 3.0.0, 2.2.0 Reporter: Nathan Roberts In order to roll a new HDFS release through a large cluster quickly and safely, a few enhancements are needed in HDFS. An initial High level design document will be attached to this jira, and sub-jiras will itemize the individual tasks. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5496) Make replication queue initialization asynchronous
[ https://issues.apache.org/jira/browse/HDFS-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan Roberts updated HDFS-5496: - Issue Type: Sub-task (was: Improvement) Parent: HDFS-5535 Make replication queue initialization asynchronous -- Key: HDFS-5496 URL: https://issues.apache.org/jira/browse/HDFS-5496 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Kihwal Lee Today, initialization of replication queues blocks safe mode exit and certain HA state transitions. For a big name space, this can take hundreds of seconds with the FSNamesystem write lock held. During this time, important requests (e.g. initial block reports, heartbeat, etc) are blocked. The effect of delaying the initialization would be not starting replication right away, but I think the benefit outweighs. If we make it asynchronous, the work per iteration should be limited, so that the lock duration is capped. If full/incremental block reports and any other requests that modifies block state properly performs replication checks while the blocks are scanned and the queues populated in background, every block will be processed. (Some may be done twice) The replication monitor should run even before all blocks are processed. This will allow namenode to exit safe mode and start serving immediately even with a big name space. It will also reduce the HA failover latency. -- This message was sent by Atlassian JIRA (v6.1#6144)