[jira] [Commented] (HDFS-14531) Datanode's ScanInfo requires excessive memory

2019-06-07 Thread Nathan Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858709#comment-16858709
 ] 

Nathan Roberts commented on HDFS-14531:
---

Actually, maybe disabling the DirectoryScanner is more than a workaround. Maybe 
that should be the default. What is this really protecting against these days? 
For large disks it's super expensive memory-wise and if there are enough blocks 
or enough system memory pressure it can cause tons of I/O as well.

 

> Datanode's ScanInfo requires excessive memory
> -
>
> Key: HDFS-14531
> URL: https://issues.apache.org/jira/browse/HDFS-14531
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Priority: Major
> Attachments: Screen Shot 2019-05-31 at 12.25.54 PM.png
>
>
> The DirectoryScanner's ScanInfo map consumes ~4.5X memory as replicas as the 
> replica map.  For 1.1M replicas: the replica map is ~91M while the scan info 
> is ~405M.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12441:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0
>
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168433#comment-16168433
 ] 

Nathan Roberts commented on HDFS-12441:
---

Cherry picked to branch-3.0, branch-2, and branch-2.8.


> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0
>
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12441:
--
Fix Version/s: 2.8.3
   2.9.0

> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.1.0
>
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12441:
--
Fix Version/s: 3.0.0-beta1

> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Fix For: 3.0.0-beta1, 3.1.0
>
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12441:
--
Fix Version/s: 3.1.0

> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12441) Suppress UnresolvedPathException in namenode log

2017-09-15 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168277#comment-16168277
 ] 

Nathan Roberts commented on HDFS-12441:
---

Thanks [~kihwal] for the patch and [~shahrs87] for the review. +1. I will 
commit this shortly.

> Suppress UnresolvedPathException in namenode log
> 
>
> Key: HDFS-12441
> URL: https://issues.apache.org/jira/browse/HDFS-12441
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Minor
> Attachments: HDFS-12441.patch
>
>
> {{UnresolvedPathException}} as a normal process of resolving symlinks. This 
> doesn't need to be logged at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-08-02 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111696#comment-16111696
 ] 

Nathan Roberts commented on HDFS-12102:
---

[~arpitagarwal] Hi Arpit. To provide a bit more background on this feature - 
we've seen multiple cases where there are many bad blocks stored on a disk. 
Just because of the way drives tend to fail, one bad block indicates there are 
probably many others. The volumeScanner will eventually find them over a 
multi-week period, but this leaves the cluster susceptible to data-loss due to 
lots of replicas being corrupt on a single misbehaving disk. The idea with this 
jira is to use a found corrupt block as a hint that there are likely more and 
we should do a scan over the drive at a faster rate to more quickly find other 
corrupt blocks on the drive. Thoughts?

> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Priority: Minor
> Fix For: 2.8.2
>
> Attachments: HDFS-12102-001.patch, HDFS-12102-002.patch, 
> HDFS-12102-003.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-5042) Completed files lost after power failure

2017-07-12 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083970#comment-16083970
 ] 

Nathan Roberts commented on HDFS-5042:
--

Wondering if we should make this feature configurable. There are some 
filesystems (like ext4), where these fsync's are affecting much more than the 
datanode process. If YARN is using the same disks and is writing significant 
amounts of intermediate data or performing other disk-heavy operations, the 
entire system will see significantly degraded performance (like disks at 100% 
for 10s of minutes). 

> Completed files lost after power failure
> 
>
> Key: HDFS-5042
> URL: https://issues.apache.org/jira/browse/HDFS-5042
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: ext3 on CentOS 5.7 (kernel 2.6.18-274.el5)
>Reporter: Dave Latham
>Assignee: Vinayakumar B
>Priority: Critical
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HDFS-5042-01.patch, HDFS-5042-02.patch, 
> HDFS-5042-03.patch, HDFS-5042-04.patch, HDFS-5042-05-branch-2.patch, 
> HDFS-5042-05.patch, HDFS-5042-branch-2-01.patch, HDFS-5042-branch-2-05.patch, 
> HDFS-5042-branch-2.7-05.patch, HDFS-5042-branch-2.7-06.patch, 
> HDFS-5042-branch-2.8-05.patch, HDFS-5042-branch-2.8-06.patch, 
> HDFS-5042-branch-2.8-addendum.patch
>
>
> We suffered a cluster wide power failure after which HDFS lost data that it 
> had acknowledged as closed and complete.
> The client was HBase which compacted a set of HFiles into a new HFile, then 
> after closing the file successfully, deleted the previous versions of the 
> file.  The cluster then lost power, and when brought back up the newly 
> created file was marked CORRUPT.
> Based on reading the logs it looks like the replicas were created by the 
> DataNodes in the 'blocksBeingWritten' directory.  Then when the file was 
> closed they were moved to the 'current' directory.  After the power cycle 
> those replicas were again in the blocksBeingWritten directory of the 
> underlying file system (ext3).  When those DataNodes reported in to the 
> NameNode it deleted those replicas and lost the file.
> Some possible fixes could be having the DataNode fsync the directory(s) after 
> moving the block from blocksBeingWritten to current to ensure the rename is 
> durable or having the NameNode accept replicas from blocksBeingWritten under 
> certain circumstances.
> Log snippets from RS (RegionServer), NN (NameNode), DN (DataNode):
> {noformat}
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils: 
> Creating 
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>  with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> NameSystem.allocateBlock: 
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>  blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block 
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: 
> /10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to 
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to 
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to 
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block 
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block 
> blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: Removing 
> lease on  file 
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>  from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.completeFile: file 
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>  is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Renaming compacted file at 
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>  to 
> 

[jira] [Commented] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-07-07 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078714#comment-16078714
 ] 

Nathan Roberts commented on HDFS-12102:
---

Hi [~aramesh2]. Thanks for the patch. Couple of quick comments (I'll look more 
closely early next week)
- Need to change hdfs-default.xml to include new config options
- LOG.debug statements should be surrounded by if (LOG.isDebugEnabled()) checks 
(reduces overhead)
- Is it possible to disable the fast-scan behavior all together (i.e. might be 
good for default behavior to remain the same. If someone want's fast-scan, they 
have to enable it). Maybe setting the period to -1 could be a way?
- Description of corruptBlockThreshold doesn't really match its use. I think 
it's just a straight count. Maybe we don't even need it at all since it's not 
configurable and only ever set to 1.





> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Priority: Minor
> Fix For: 2.8.2
>
> Attachments: HDFS-12102-001.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-07-07 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12102:
--
Issue Type: New Feature  (was: Improvement)

> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Priority: Minor
> Fix For: 2.8.2
>
> Attachments: HDFS-12102-001.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-07-07 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-12102:
--
Issue Type: Improvement  (was: New Feature)

> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Priority: Minor
> Fix For: 2.8.2
>
> Attachments: HDFS-12102-001.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently

2017-05-12 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11818:
--
Affects Version/s: 3.0.0-alpha2
   Status: Patch Available  (was: Open)

> TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
> ---
>
> Key: HDFS-11818
> URL: https://issues.apache.org/jira/browse/HDFS-11818
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.2
>Reporter: Eric Badger
>Assignee: Nathan Roberts
> Attachments: HDFS-11818-branch-2.patch, HDFS-11818.patch
>
>
> Saw a weird Mockito failure in last night's build with the following stack 
> trace:
> {noformat}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> INodeFile cannot be returned by isRunning()
> isRunning() should return boolean
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397)
> {noformat}
> This is pretty confusing since we explicitly set isRunning() to return true 
> in TestBlockManager's \@Before method
> {noformat}
> 154Mockito.doReturn(true).when(fsn).isRunning();
> {noformat}
> Also saw the following exception in the logs:
> {noformat}
> 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager 
> (BlockManager.java:run(2796)) - Error while processing replication queues 
> async
> org.mockito.exceptions.base.MockitoException: 
> 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a 
> *return value*!
> Voids are usually stubbed with Throwables:
> doThrow(exception).when(mock).someVoidMethod();
> If the method you are trying to stub is *overloaded* then make sure you are 
> calling the right overloaded version.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792)
> {noformat}
> This is also weird since we don't do any explicit mocking with 
> {{writeLockInterruptibly}} via fsn in the test. It has to be something 
> changing the mocks or non-thread safe access or something like that. I can't 
> explain the failures otherwise. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently

2017-05-12 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11818:
--
Attachment: HDFS-11818.patch
HDFS-11818-branch-2.patch

Patches for trunk and branch-2. branch-2 patch picks cleanly to 2.8

> TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
> ---
>
> Key: HDFS-11818
> URL: https://issues.apache.org/jira/browse/HDFS-11818
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.2
>Reporter: Eric Badger
>Assignee: Nathan Roberts
> Attachments: HDFS-11818-branch-2.patch, HDFS-11818.patch
>
>
> Saw a weird Mockito failure in last night's build with the following stack 
> trace:
> {noformat}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> INodeFile cannot be returned by isRunning()
> isRunning() should return boolean
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397)
> {noformat}
> This is pretty confusing since we explicitly set isRunning() to return true 
> in TestBlockManager's \@Before method
> {noformat}
> 154Mockito.doReturn(true).when(fsn).isRunning();
> {noformat}
> Also saw the following exception in the logs:
> {noformat}
> 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager 
> (BlockManager.java:run(2796)) - Error while processing replication queues 
> async
> org.mockito.exceptions.base.MockitoException: 
> 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a 
> *return value*!
> Voids are usually stubbed with Throwables:
> doThrow(exception).when(mock).someVoidMethod();
> If the method you are trying to stub is *overloaded* then make sure you are 
> calling the right overloaded version.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792)
> {noformat}
> This is also weird since we don't do any explicit mocking with 
> {{writeLockInterruptibly}} via fsn in the test. It has to be something 
> changing the mocks or non-thread safe access or something like that. I can't 
> explain the failures otherwise. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently

2017-05-12 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16008325#comment-16008325
 ] 

Nathan Roberts commented on HDFS-11818:
---

Know what the issue is. Will post a patch shortly. 

> TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
> ---
>
> Key: HDFS-11818
> URL: https://issues.apache.org/jira/browse/HDFS-11818
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Nathan Roberts
>
> Saw a weird Mockito failure in last night's build with the following stack 
> trace:
> {noformat}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> INodeFile cannot be returned by isRunning()
> isRunning() should return boolean
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397)
> {noformat}
> This is pretty confusing since we explicitly set isRunning() to return true 
> in TestBlockManager's \@Before method
> {noformat}
> 154Mockito.doReturn(true).when(fsn).isRunning();
> {noformat}
> Also saw the following exception in the logs:
> {noformat}
> 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager 
> (BlockManager.java:run(2796)) - Error while processing replication queues 
> async
> org.mockito.exceptions.base.MockitoException: 
> 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a 
> *return value*!
> Voids are usually stubbed with Throwables:
> doThrow(exception).when(mock).someVoidMethod();
> If the method you are trying to stub is *overloaded* then make sure you are 
> calling the right overloaded version.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792)
> {noformat}
> This is also weird since we don't do any explicit mocking with 
> {{writeLockInterruptibly}} via fsn in the test. It has to be something 
> changing the mocks or non-thread safe access or something like that. I can't 
> explain the failures otherwise. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-11818) TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently

2017-05-12 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts reassigned HDFS-11818:
-

Assignee: Nathan Roberts

> TestBlockManager.testSufficientlyReplBlocksUsesNewRack fails intermittently
> ---
>
> Key: HDFS-11818
> URL: https://issues.apache.org/jira/browse/HDFS-11818
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Eric Badger
>Assignee: Nathan Roberts
>
> Saw a weird Mockito failure in last night's build with the following stack 
> trace:
> {noformat}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> INodeFile cannot be returned by isRunning()
> isRunning() should return boolean
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.addBlockOnNodes(TestBlockManager.java:555)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:404)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testSufficientlyReplBlocksUsesNewRack(TestBlockManager.java:397)
> {noformat}
> This is pretty confusing since we explicitly set isRunning() to return true 
> in TestBlockManager's \@Before method
> {noformat}
> 154Mockito.doReturn(true).when(fsn).isRunning();
> {noformat}
> Also saw the following exception in the logs:
> {noformat}
> 2017-05-12 05:42:27,903 ERROR blockmanagement.BlockManager 
> (BlockManager.java:run(2796)) - Error while processing replication queues 
> async
> org.mockito.exceptions.base.MockitoException: 
> 'writeLockInterruptibly' is a *void method* and it *cannot* be stubbed with a 
> *return value*!
> Voids are usually stubbed with Throwables:
> doThrow(exception).when(mock).someVoidMethod();
> If the method you are trying to stub is *overloaded* then make sure you are 
> calling the right overloaded version.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatesAsync(BlockManager.java:2841)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.access$100(BlockManager.java:120)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$1.run(BlockManager.java:2792)
> {noformat}
> This is also weird since we don't do any explicit mocking with 
> {{writeLockInterruptibly}} via fsn in the test. It has to be something 
> changing the mocks or non-thread safe access or something like that. I can't 
> explain the failures otherwise. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-10 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005127#comment-16005127
 ] 

Nathan Roberts commented on HDFS-11755:
---

The failing unit tests in trunk have been unstable in precommit:
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testMultipleVolFailuresOnNode
org.apache.hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure.testMultipleDatanodeFailure56
The timed out test TestLeaseRecovery2 does not fail locally and has also been 
unstable across multiple precommit runs on this jira.


> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch, 
> HDFS-11755-branch-2.002.patch, HDFS-11755-branch-2.8.002.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-10 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Attachment: HDFS-11755-branch-2.002.patch
HDFS-11755-branch-2.8.002.patch

branch-2 and branch-2.8 patches.

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch, 
> HDFS-11755-branch-2.002.patch, HDFS-11755-branch-2.8.002.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-10 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Attachment: HDFS-11755.002.patch

Fixed Checkstyle
Fixed testSetReplicationWhenBatchIBR because it was expecting a 
setReplication() on a file with only under construction blocks to cause 
underReplicated counts to increase.

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-10 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Status: Patch Available  (was: Open)

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-10 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Status: Open  (was: Patch Available)

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-09 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003459#comment-16003459
 ] 

Nathan Roberts commented on HDFS-11755:
---

bq. Do you know which one makes more sense?
Not an expert in this area but here's my understanding. When a block is 
completed and the client has received the necessary acks, the client either 
adds another block, or completes the file. Both cause the namenode to consider 
the block complete, and at that point the namenode will properly maintain 
replication of the completed block. If the pipeline fails while writing, the 
client may (depends on policy configured) rebuild the pipeline to maintain the 
desired level of replication in the pipeline. So, while a block is mutating, it 
is the client that is ultimately responsible for making sure enough datanodes 
remain in the pipeline and in-sync with the data. Once a block is complete, it 
becomes the namenode's responsibility to maintain replication. 

If a client dies and fails to complete the last block, after a timeout, lease 
recovery will cause the file to be closed and the blocks to be properly 
synchronized and committed if possible.  

There is also hsync(), which applications can use to enhance the durability 
guarantees at the datanode (via fsync).

Hope that helps a little.


> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-09 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Status: Patch Available  (was: Open)

v1  of trunk patch. branch 2 will require a separate patch.

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-09 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-11755:
--
Attachment: HDFS-11755.001.patch

> Underconstruction blocks can be considered missing
> --
>
> Key: HDFS-11755
> URL: https://issues.apache.org/jira/browse/HDFS-11755
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha2, 2.8.1
>Reporter: Nathan Roberts
>Assignee: Nathan Roberts
> Attachments: HDFS-11755.001.patch
>
>
> Following sequence of events can lead to a block underconstruction being 
> considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It 
> does eventually succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
> DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline 
> and notifies NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block 
> missing because there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11755) Underconstruction blocks can be considered missing

2017-05-04 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-11755:
-

 Summary: Underconstruction blocks can be considered missing
 Key: HDFS-11755
 URL: https://issues.apache.org/jira/browse/HDFS-11755
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0-alpha2, 2.8.1
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Following sequence of events can lead to a block underconstruction being 
considered missing.

- pipeline of 3 DNs, DN1->DN2->DN3
- DN3 has a failing disk so some updates take a long time
- Client writes entire block and is waiting for final ack
- DN1, DN2 and DN3 have all received the block 
- DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
- DN3 is having trouble finalizing the block due to the failing drive. It does 
eventually succeed but it is VERY slow at doing so. 
- DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so 
DN1 notices and does the same. Neither DN1 nor DN2 finalized the block.
- DN3 finally sends an IBR to the NN indicating the block has been received.
- Drive containing the block on DN3 fails enough that the DN takes it offline 
and notifies NN of failed volume
- NN removes DN3's replica from the triplets and then declares the block 
missing because there are no other replicas

Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11752) getNonDfsUsed return 0 if reserved bigger than actualNonDfsUsed

2017-05-04 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996815#comment-15996815
 ] 

Nathan Roberts commented on HDFS-11752:
---

I think this is intentional. My understanding is that nonDFSUsed is supposed to 
be the amount of HDFS space that is consumed by non HDFS entities. Since the 
Reserved space is not available to HDFS, nonDFSUsed shouldn't include any usage 
covered by the reservation. 

Are you seeing issues in 2.7 or 3.0 because there have been fixes in 2.8 and 
3.0 which change some of the calculations in this area? 2.7 definitely had 
problems where it was not correctly calculating the amount of remaining space. 
The 2.8 calculations seem correct (I didn't try 3.0 but as long as nothing 
regressed it should be ok as well). 

> getNonDfsUsed return 0 if reserved bigger than actualNonDfsUsed
> ---
>
> Key: HDFS-11752
> URL: https://issues.apache.org/jira/browse/HDFS-11752
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.7.1
>Reporter: maobaolong
>  Labels: datanode, hdfs
> Fix For: 2.7.1
>
>
> {code}
> public long getNonDfsUsed() throws IOException {
> long actualNonDfsUsed = getActualNonDfsUsed();
> if (actualNonDfsUsed < reserved) {
>   return 0L;
> }
> return actualNonDfsUsed - reserved;
>   }
> {code}
> The code block above is the function to caculate nonDfsUsed, but in fact it 
> will let the result to be 0L out of expect. Such as this following situation:
> du.reserved  = 50G
> Disk Capacity = 2048G
> Disk Available = 2000G
> Dfs used = 30G
> usage.getUsed() = dirFile.getTotalSpace() - dirFile.getFreeSpace()
> = 2048G - 2000G
> = 48G
> getActualNonDfsUsed  =  usage.getUsed() - getDfsUsed()
>   =  48G - 30G
>   = 18G
> 18G < 50G, so the function `getNonDfsUsed` actualNonDfsUsed < reserved, and 
> the NonDfsUsed will return 0, is that logic make sense?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-04-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977163#comment-15977163
 ] 

Nathan Roberts edited comment on HDFS-11661 at 4/20/17 6:05 PM:


[~jojochuang], sure, occasional `du -s` of very large directory trees, think 
100s of millions of files/directories. 





was (Author: nroberts):
[~jojochuang], sure, occasional `du -s` of very large directory trees, think 
many 100s of millions of files/directories. 




> GetContentSummary uses excessive amounts of memory
> --
>
> Key: HDFS-11661
> URL: https://issues.apache.org/jira/browse/HDFS-11661
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Nathan Roberts
>Priority: Blocker
> Attachments: Heap growth.png
>
>
> ContentSummaryComputationContext::nodeIncluded() is being used to keep track 
> of all INodes visited during the current content summary calculation. This 
> can be all of the INodes in the filesystem, making for a VERY large hash 
> table. This simply won't work on large filesystems. 
> We noticed this after upgrading a namenode with ~100Million filesystem 
> objects was spending significantly more time in GC. Fortunately this system 
> had some memory breathing room, other clusters we have will not run with this 
> additional demand on memory.
> This was added as part of HDFS-10797 as a way of keeping track of INodes that 
> have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-04-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977163#comment-15977163
 ] 

Nathan Roberts commented on HDFS-11661:
---

[~jojochuang], sure, occasional `du -s` of very large directory trees, think 
many 100s of millions of files/directories. 




> GetContentSummary uses excessive amounts of memory
> --
>
> Key: HDFS-11661
> URL: https://issues.apache.org/jira/browse/HDFS-11661
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Nathan Roberts
>Priority: Blocker
> Attachments: Heap growth.png
>
>
> ContentSummaryComputationContext::nodeIncluded() is being used to keep track 
> of all INodes visited during the current content summary calculation. This 
> can be all of the INodes in the filesystem, making for a VERY large hash 
> table. This simply won't work on large filesystems. 
> We noticed this after upgrading a namenode with ~100Million filesystem 
> objects was spending significantly more time in GC. Fortunately this system 
> had some memory breathing room, other clusters we have will not run with this 
> additional demand on memory.
> This was added as part of HDFS-10797 as a way of keeping track of INodes that 
> have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10797) Disk usage summary of snapshots causes renamed blocks to get counted twice

2017-04-18 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973105#comment-15973105
 ] 

Nathan Roberts commented on HDFS-10797:
---

After deploying this to a cluster with a few hundred nodes, we have discovered 
that this jira has caused significant memory bloat in the namenode. Filed 2.8.1 
blocker for this issue - HDFS-11661. 

> Disk usage summary of snapshots causes renamed blocks to get counted twice
> --
>
> Key: HDFS-10797
> URL: https://issues.apache.org/jira/browse/HDFS-10797
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Affects Versions: 2.8.0
>Reporter: Sean Mackrory
>Assignee: Sean Mackrory
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HDFS-10797.001.patch, HDFS-10797.002.patch, 
> HDFS-10797.003.patch, HDFS-10797.004.patch, HDFS-10797.005.patch, 
> HDFS-10797.006.patch, HDFS-10797.007.patch, HDFS-10797.008.patch, 
> HDFS-10797.009.patch, HDFS-10797.010.patch, HDFS-10797.010.patch
>
>
> DirectoryWithSnapshotFeature.computeContentSummary4Snapshot calculates how 
> much disk usage is used by a snapshot by tallying up the files in the 
> snapshot that have since been deleted (that way it won't overlap with regular 
> files whose disk usage is computed separately). However that is determined 
> from a diff that shows moved (to Trash or otherwise) or renamed files as a 
> deletion and a creation operation that may overlap with the list of blocks. 
> Only the deletion operation is taken into consideration, and this causes 
> those blocks to get represented twice in the disk usage tallying.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-04-17 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-11661:
-

 Summary: GetContentSummary uses excessive amounts of memory
 Key: HDFS-11661
 URL: https://issues.apache.org/jira/browse/HDFS-11661
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.8.0
Reporter: Nathan Roberts
Priority: Blocker


ContentSummaryComputationContext::nodeIncluded() is being used to keep track of 
all INodes visited during the current content summary calculation. This can be 
all of the INodes in the filesystem, making for a VERY large hash table. This 
simply won't work on large filesystems. 

We noticed this after upgrading a namenode with ~100Million filesystem objects 
was spending significantly more time in GC. Fortunately this system had some 
memory breathing room, other clusters we have will not run with this additional 
demand on memory.

This was added as part of HDFS-10797 as a way of keeping track of INodes that 
have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-25 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-4660:
-
Attachment: periodic_hflush.patch

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch, periodic_hflush.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2016-08-25 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437036#comment-15437036
 ] 

Nathan Roberts commented on HDFS-4660:
--

Hi [~yzhangal]. Had to go back to an old git stash, but I'll attach a sample 
patch to TeraOutputFormat.

> Block corruption can happen during pipeline recovery
> 
>
> Key: HDFS-4660
> URL: https://issues.apache.org/jira/browse/HDFS-4660
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.0.3-alpha, 3.0.0-alpha1
>Reporter: Peng Zhang
>Assignee: Kihwal Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.6.4
>
> Attachments: HDFS-4660.br26.patch, HDFS-4660.patch, HDFS-4660.patch, 
> HDFS-4660.v2.patch
>
>
> pipeline DN1  DN2  DN3
> stop DN2
> pipeline added node DN4 located at 2nd position
> DN1  DN4  DN3
> recover RBW
> DN4 after recover rbw
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
> 2013-04-01 21:02:31,570 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134144
>   getBytesOnDisk() = 134144
>   getVisibleLength()= 134144
> end at chunk (134144/512=262)
> DN3 after recover rbw
> 2013-04-01 21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
> RBW replica 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
>  21:02:31,575 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
>   getNumBytes() = 134028 
>   getBytesOnDisk() = 134028
>   getVisibleLength()= 134028
> client send packet after recover pipeline
> offset=133632  len=1008
> DN4 after flush 
> 2013-04-01 21:02:31,779 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1063
> // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
> 1063.
> DN3 after flush
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
> lastPacketInBlock=false, offsetInBlock=134640, 
> ackEnqueueNanoTime=8817026136871545)
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
> meta file offset of block 
> BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
> 1055 to 1051
> 2013-04-01 21:02:31,782 DEBUG 
> org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
> offset:134640; meta offset:1059
> After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
> data not.
> Later after block was finalized, DN4's scanner detected bad block, and then 
> reported it to NM. NM send a command to delete this block, and replicate this 
> block from other DN in pipeline to satisfy duplication num.
> I think this is because in BlockReceiver it skips data bytes already written, 
> but not skips checksum bytes already written. And function 
> adjustCrcFilePosition is only used for last non-completed chunk, but
> not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218470#comment-15218470
 ] 

Nathan Roberts commented on HDFS-9239:
--

bq. Just to make sure I'm clear, are you talking about configuring the deadline 
scheduler as described here?

Yes, those links are talking about the right parameters. 

We currently run with read_expire=1000, write_expire=1000, and 
writes_starved=1. Since our I/O workloads change dramatically over time, we 
didn't spend a lot of time looking for optimal values here. These have been 
working well for the last several months across multiple clusters.

As an aside, a relatively easy way to reproduce this problem, is to put a heavy 
seek load on all the disks of a datanode (e.g. 
http://www.linuxinsight.com/how_fast_is_your_disk.html, I believe 5-10 copies 
of seeker were sufficient.) After a minute or so, system becomes almost 
unusable and datanode will be declared lost. This might be a good test to run 
against the lifeline protocol. My hunch is, with CFQ, the datanode will still 
be lost. 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9239) DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness

2016-03-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218302#comment-15218302
 ] 

Nathan Roberts commented on HDFS-9239:
--

bq. However,making it lighter on the datanode side is a good idea. We have seen 
many cases where nodes are declared dead because the service actor thread is 
delayed/blocked. 

Just a quick update on this comment. Even after HDFS-7060 we still had cases 
where Datanodes would fail to heartbeat in. We eventually tracked this down to 
the RHEL CFQ I/O scheduler. There are situations where significant seek 
activity (like a massive shuffle) can cause this I/O scheduler to indefinitely 
starve writers. This eventually causes the datanode and/or nodemanager 
processes to completely stop (probably due to logging I/O backing up). So, no 
matter how smart we make these daemons, they are going to be lost from the 
NN/RM point of view in these situations. But, this is actually probably the 
right thing to do in these cases, these daemons are clearly not able to do 
their job so SHOULD be declared lost. 

In any event, the change which we found most valuable for this situation was to 
use the deadline I/O scheduler. This dramatically improved the number of lost 
datanodes and nodemanagers we were seeing.
 

> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> ---
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch, 
> HDFS-9239.002.patch, HDFS-9239.003.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable

2016-02-18 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-4946:
-
Affects Version/s: 3.0.0
   2.7.2
 Target Version/s: 3.0.0, 2.8.0
   Status: Patch Available  (was: Reopened)

> Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
> ---
>
> Key: HDFS-4946
> URL: https://issues.apache.org/jira/browse/HDFS-4946
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.2, 2.0.0-alpha, 3.0.0
>Reporter: James Kinley
>Assignee: James Kinley
> Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch
>
>
> Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in 
> configuration to prevent a client from writing the first replica of every 
> block (i.e. the entire file) to the local DataNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable

2016-02-18 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts reopened HDFS-4946:
--

[~jrkinley], re-opening because this is a very useful patch. Let me know if you 
disagree or would like me to assign it to myself to close out any remaining 
issues.

> Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
> ---
>
> Key: HDFS-4946
> URL: https://issues.apache.org/jira/browse/HDFS-4946
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: James Kinley
>Assignee: James Kinley
> Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch
>
>
> Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in 
> configuration to prevent a client from writing the first replica of every 
> block (i.e. the entire file) to the local DataNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-4946) Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable

2016-02-18 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-4946:
-
Attachment: HDFS-4946-2.patch

Uploaded a new version of this patch for trunk. We have found this config to be 
extremely useful across many large clusters. It avoids hot-spots for large 
files that can be quite problematic during localization and/or task scheduling.

Hopefully folks will be agreeable to this simple config option.

> Allow preferLocalNode in BlockPlacementPolicyDefault to be configurable
> ---
>
> Key: HDFS-4946
> URL: https://issues.apache.org/jira/browse/HDFS-4946
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: James Kinley
>Assignee: James Kinley
> Attachments: HDFS-4946-1.patch, HDFS-4946-2.patch
>
>
> Allow preferLocalNode in BlockPlacementPolicyDefault to be disabled in 
> configuration to prevent a client from writing the first replica of every 
> block (i.e. the entire file) to the local DataNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2015-11-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018828#comment-15018828
 ] 

Nathan Roberts commented on HDFS-7060:
--

Has anyone given any further thought on this patch? It seems safe to me and it 
eliminates serious stability issues when datanodes' disks get very busy. 
_getUsed()_ returns a value that is calculated by a DU process that ran for a 
long time anyway, (so it's always somewhat out-of-sync). It's not very 
difficult to load up a datanode with disk-intensive tasks that prevent the 
datanode from getting a heartbeat in for several minutes, eventually being 
declared dead by the NN. We've seen this take out entire clusters with large 
Map/Reduce merges, as well as very large shuffles. 

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Xinwei Qin 
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, 
> HDFS-7060.001.patch
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-11-19 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014463#comment-15014463
 ] 

Nathan Roberts commented on HDFS-8791:
--

Thanks [~ctrezzo] for the patch! Nice writeup on the verification/performance 
measurements.
+1 (non-binding) on the patch. It's nice how concise it was able to be. 


> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Critical
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> HDFS-8791-trunk-v1.patch
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8873) throttle directoryScanner

2015-09-24 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906603#comment-14906603
 ] 

Nathan Roberts commented on HDFS-8873:
--

Thanks [~templedf]. I like that the stopwatch class makes this much cleaner. 
Just a couple of comments:
- Shouldn't the isInterrupted() check throw an InterruptedException? Otherwise 
won't we just break out of one level? It would probably be good to test 
shutdown on an actual cluster if possible because you're exactly right that we 
could be in here a long time and it would be good to make sure we don't affect 
shutdown of the datanode. This has been a problem in the past and can have a 
serious impact on rolling upgrades.
- nit but I find markRunning() and markWaiting() confusing (seem backwards to 
me because we call markRunning() just before going to sleep).
- I'm kind of wondering if we should disallow extremely low duty cycles. Seems 
like it could take close to 24 hours with a minimum setting. A minimum of 20% 
should keep us within an hour.

> throttle directoryScanner
> -
>
> Key: HDFS-8873
> URL: https://issues.apache.org/jira/browse/HDFS-8873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Daniel Templeton
> Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, 
> HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch, 
> HDFS-8873.006.patch, HDFS-8873.007.patch, HDFS-8873.008.patch
>
>
> The new 2-level directory layout can make directory scans expensive in terms 
> of disk seeks (see HDFS-8791) for details. 
> It would be good if the directoryScanner() had a configurable duty cycle that 
> would reduce its impact on disk performance (much like the approach in 
> HDFS-8617). 
> Without such a throttle, disks can go 100% busy for many minutes at a time 
> (assuming the common case of all inodes in cache but no directory blocks 
> cached, 64K seeks are required for full directory listing which translates to 
> 655 seconds) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8873) throttle directoryScanner

2015-09-24 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906919#comment-14906919
 ] 

Nathan Roberts commented on HDFS-8873:
--

Thanks [~templedf] for the update! I'm +1 (non-binding) for v9 of the patch.

> throttle directoryScanner
> -
>
> Key: HDFS-8873
> URL: https://issues.apache.org/jira/browse/HDFS-8873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Daniel Templeton
> Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, 
> HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch, 
> HDFS-8873.006.patch, HDFS-8873.007.patch, HDFS-8873.008.patch, 
> HDFS-8873.009.patch
>
>
> The new 2-level directory layout can make directory scans expensive in terms 
> of disk seeks (see HDFS-8791) for details. 
> It would be good if the directoryScanner() had a configurable duty cycle that 
> would reduce its impact on disk performance (much like the approach in 
> HDFS-8617). 
> Without such a throttle, disks can go 100% busy for many minutes at a time 
> (assuming the common case of all inodes in cache but no directory blocks 
> cached, 64K seeks are required for full directory listing which translates to 
> 655 seconds) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8873) throttle directoryScanner

2015-09-23 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904605#comment-14904605
 ] 

Nathan Roberts commented on HDFS-8873:
--

Thanks [~templedf] for the update. I am sorry I haven't had a chance to review 
yet. I plan to do this Thursday AM. 

> throttle directoryScanner
> -
>
> Key: HDFS-8873
> URL: https://issues.apache.org/jira/browse/HDFS-8873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Daniel Templeton
> Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, 
> HDFS-8873.003.patch, HDFS-8873.004.patch, HDFS-8873.005.patch
>
>
> The new 2-level directory layout can make directory scans expensive in terms 
> of disk seeks (see HDFS-8791) for details. 
> It would be good if the directoryScanner() had a configurable duty cycle that 
> would reduce its impact on disk performance (much like the approach in 
> HDFS-8617). 
> Without such a throttle, disks can go 100% busy for many minutes at a time 
> (assuming the common case of all inodes in cache but no directory blocks 
> cached, 64K seeks are required for full directory listing which translates to 
> 655 seconds) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8873) throttle directoryScanner

2015-09-18 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876736#comment-14876736
 ] 

Nathan Roberts commented on HDFS-8873:
--

bq. Sure. It won't matter since stop() is only called by shutdown(), which 
first sets shouldRunCompile to false. But for correctness, you're right.
I was looking at v3 of the patch, I see you already fixed this in v4 of the 
patch. sorry for the noise

bq. The majority of the patch is refactoring the report compilers so that they 
can be throttled at all. The additional code to do the throttling isn't much. 
It's more formal than just a sleep, but it's also more testable and extensible.
ok. I'll have to look at that a little deeper. I thought we were basically 
hitting FileUtil.listFiles(dir) really quickly in the original code so it felt 
like a very simple thing to do is just do a variable sleep right there based on 
the configured duty cycle. 

I need to look more into how the scanjob queue is working. It seems like all 
the worker threads could be working in the same volume which doesn't seem like 
what we want. (Seems like we want all volumes spending the duty cycle scanning, 
but I didn't catch how that was the case).





> throttle directoryScanner
> -
>
> Key: HDFS-8873
> URL: https://issues.apache.org/jira/browse/HDFS-8873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Daniel Templeton
> Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, 
> HDFS-8873.003.patch, HDFS-8873.004.patch
>
>
> The new 2-level directory layout can make directory scans expensive in terms 
> of disk seeks (see HDFS-8791) for details. 
> It would be good if the directoryScanner() had a configurable duty cycle that 
> would reduce its impact on disk performance (much like the approach in 
> HDFS-8617). 
> Without such a throttle, disks can go 100% busy for many minutes at a time 
> (assuming the common case of all inodes in cache but no directory blocks 
> cached, 64K seeks are required for full directory listing which translates to 
> 655 seconds) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8873) throttle directoryScanner

2015-09-18 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876500#comment-14876500
 ] 

Nathan Roberts commented on HDFS-8873:
--

Thanks [~templedf] for the patch. A few comments.

hdfs-default.xml
- I personally would prefer the default to be 1000. In my mind 0 is a special 
out-of-range condition that we're allowing to mean "full rate". Just reading 
the default of 0 and then the first sentence of the description could easily 
lead one to believe the report threads are effectively off by default. 
Test
- Any way to avoid the sleep(5000) in the test? Our tests already take a really 
long time, so anytime we can avoid sleeping it's better. Maybe wait at most 5 
seconds for timeWaitingMs.get() to become > 0
directoryScannerThrottle
- Shouldn't stop call resume() instead of just notifyAll(). Will cycle() get 
out if we try to shutdown while in that wait?

- Did we hit this problem with too big of hammer? Couldn't cycle() be 
implemented with a simple sleep? For example, with a 75% duty cycle, 
{noformat}
n = Time.monotonicNow() % 1000;
if (n > 1000 * 0.75) sleep(1000- n)
{noformat} 
Seems like it could be as simple as a config and a couple of lines of code. 
Maybe I'm missing something or there are grander plans for the throttle.



> throttle directoryScanner
> -
>
> Key: HDFS-8873
> URL: https://issues.apache.org/jira/browse/HDFS-8873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Nathan Roberts
>Assignee: Daniel Templeton
> Attachments: HDFS-8873.001.patch, HDFS-8873.002.patch, 
> HDFS-8873.003.patch, HDFS-8873.004.patch
>
>
> The new 2-level directory layout can make directory scans expensive in terms 
> of disk seeks (see HDFS-8791) for details. 
> It would be good if the directoryScanner() had a configurable duty cycle that 
> would reduce its impact on disk performance (much like the approach in 
> HDFS-8617). 
> Without such a throttle, disks can go 100% busy for many minutes at a time 
> (assuming the common case of all inodes in cache but no directory blocks 
> cached, 64K seeks are required for full directory listing which translates to 
> 655 seconds) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8894) Set SO_KEEPALIVE on DN server sockets

2015-08-13 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8894:


 Summary: Set SO_KEEPALIVE on DN server sockets
 Key: HDFS-8894
 URL: https://issues.apache.org/jira/browse/HDFS-8894
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.1
Reporter: Nathan Roberts


SO_KEEPALIVE is not set on things like datastreamer sockets which can cause 
lingering ESTABLISHED sockets when there is a network glitch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8873) throttle directoryScanner

2015-08-07 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8873:


 Summary: throttle directoryScanner
 Key: HDFS-8873
 URL: https://issues.apache.org/jira/browse/HDFS-8873
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.1
Reporter: Nathan Roberts


The new 2-level directory layout can make directory scans expensive in terms of 
disk seeks (see HDFS-8791) for details. 

It would be good if the directoryScanner() had a configurable duty cycle that 
would reduce its impact on disk performance (much like the approach in 
HDFS-8617). 

Without such a throttle, disks can go 100% busy for many minutes at a time 
(assuming the common case of all inodes in cache but no directory blocks 
cached, 64K seeks are required for full directory listing which translates to 
655 seconds) 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-08-04 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654426#comment-14654426
 ] 

Nathan Roberts commented on HDFS-8791:
--

My preference would be to take a smaller incremental step.

How about:
- New layout where n x m levels are configurable (today 256x256)
- n x m is recorded in version file
- Upgrade path is taken if configured n x m is different from n x m in VERSION 
file

Seems like most of the code will work without too much modification (and the 
risk that comes with it).

I fear if we try to take too much of a step at this point, it will take 
significant time to settle on the new layout, and then it will end up being 
either extremely close to what we have now OR it will be radically different 
and require a lot of investment of time and resources to even get there.

In other words, I think we need a short term layout change that is low-risk and 
quick to integrate.



 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-08-04 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653974#comment-14653974
 ] 

Nathan Roberts commented on HDFS-8791:
--

Curious what folks would think about going back to previous layout? I 
understand there was some benefit to the new layout but maybe there are nearly 
equivalent and less-intrusive ways to achieve the same benefits. I'm confident 
the current layout is going to cause significant performance issues for HDFS, 
and latency sensitive applications (e.g. Hbase) are going to feel this in a big 
way.



 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab

2015-07-21 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635299#comment-14635299
 ] 

Nathan Roberts commented on HDFS-6407:
--

My understanding is that the Legacy UI was removed in 2.7. With the legacyUI 
gone, we've lost very valuable functionality. I use the sort capability all of 
the time to do things like: find nodes running different versions during a 
rolling upgrade, evaluate how the balancer is doing by sorting on capacity, 
find very full nodes to see how their disks are performing, sort on Admin state 
to find all decomissioning nodes . I don't think it's a blocker for a release, 
but a loss of commonly used functionality can be very annoying for users.  

 new namenode UI, lost ability to sort columns in datanode tab
 -

 Key: HDFS-6407
 URL: https://issues.apache.org/jira/browse/HDFS-6407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Nathan Roberts
Assignee: Benoy Antony
Priority: Minor
  Labels: BB2015-05-TBR
 Attachments: 002-datanodes-sorted-capacityUsed.png, 
 002-datanodes.png, 002-filebrowser.png, 002-snapshots.png, 
 HDFS-6407-002.patch, HDFS-6407-003.patch, HDFS-6407.patch, 
 browse_directory.png, datanodes.png, snapshots.png


 old ui supported clicking on column header to sort on that column. The new ui 
 seems to have dropped this very useful feature.
 There are a few tables in the Namenode UI to display  datanodes information, 
 directory listings and snapshots.
 When there are many items in the tables, it is useful to have ability to sort 
 on the different columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-21 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635908#comment-14635908
 ] 

Nathan Roberts commented on HDFS-8791:
--

bq. I'm having trouble understanding these kernel settings. 
http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning 
says that When vfs_cache_pressure=0, the kernel will never reclaim dentries 
and inodes due to memory pressure and this can easily lead to out-of-memory 
conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to 
prefer to reclaim dentries and inodes. So that would seem to indicate that 
vfs_cache_pressure does have control over dentries (i.e. the directory blocks 
which contain the list of child inodes). What settings have you used for 
vfs_cache_pressure so far?

Not a linux filesystem expert, but here's where I think the confusion is:
- inodes are cached in ext4_inode slab
- dentries are cached in dentry slab
- directory blocks are cached in the buffer cache
- lookups (e.g. stat /subdir1/subdir2/blk_0) can be satisfied with the 
dentry+inode cache
- readdir cannot be satisfied by the dentry cache, it needs to see the blocks 
from the disk (hence the buffer cache)

I can somewhat protect the inode+dentry by setting vfs_cache_pressure to 1 
(setting to 0 can be very bad because negative dentries can fill up your entire 
memory, I think). I tried setting vfs_cache_pressure to 0, and it didn't seem 
to help the case we are seeing.

I used blktrace to capture what was happening when a node was doing this. I 
then dumped the raw data at the offsets captured by blktrace. The data showed 
that the seeks were all the result of reading directory blocks, not inodes.

bq. I think if we're going to change the on-disk layout format again, we should 
change the way we name meta files. Currently, we encode the genstamp in the 
file name, like blk_1073741915_1091.meta. This means that to look up the meta 
file for block 1073741915, we have to iterate through every file in the 
subdirectory until we find it. Instead, we could simply name the meta file as 
blk_107374191.meta and put the genstamp number in the meta file header. This 
would allow us to move to a scheme which had a very large number of blocks in 
each directory (perhaps a simple 1-level hashing scheme) and the dentries would 
always be hot. ext4 and other modern Linux filesystems deal very effectively 
with large directories-- it's only ext2 and ext3 without certain options 
enabled that had problems.

I'm a little confused about iterating to find the meta file. Don't we already 
keep track of the genstamp we discovered during startup? If so, it seems like a 
simple stat is sufficient.

I haven't tried xfs, but that would also be a REALLY heavy hammer in our case;) 
 





 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 

[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633669#comment-14633669
 ] 

Nathan Roberts commented on HDFS-8791:
--

Hi [~cmccabe]. Thanks for the idea. Yes, I had actually tried something like 
that. I actually just kept a loop of DU's running on the node (outside of the 
datanode process for simplicity sake). I thought this would prevent it from 
happening but it turns out it still gets into this situation. I suspect the 
reason is that when there is memory pressure, it will start to seek a little, 
and then once it starts to seek a little the system quickly degrades because 
buffers are being thrown away faster than the disks can seek. 

 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633804#comment-14633804
 ] 

Nathan Roberts commented on HDFS-8791:
--

I agree we should optimize all the potential scans (du, checkDirs, 
directoryScanner, etc)

I also think we need to do something more general because I feel like people 
will trip on this in all sorts of ways. Even tools outside of the DN process 
that do periodic scans will be affected and will in-turn adversely affect the 
datenode's performance. Also, it's hard to see this problem until you're 
running at scale so it will be difficult to catch jiras that introduce yet 
another scan, because they run really fast when everything is in memory.

I'm wondering if we shouldn't move to a hashing scheme that is more dynamic and 
grows/shrinks based on the number of blocks in the volume. A consistent hash to 
minimize renames, plus some logic that knows how to look in two places (old 
hash, new hash), seems like it might work. We could set a threshold of avg 100 
blocks per directory, when we cross that threshold then we add enough subdirs 
to bring the avg down to 95. 

I think ext2 and ext3 will see a similar problem. Are you seeing something 
different? I'll admit that my understanding of the differences isn't 
exhaustive, but it sure seems like all of them rely on the buffer cache to 
maintain directory blocks and all of them try to spread directories across the 
disk, so they'd all be subject to the same sort of thing. 


 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-20 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633675#comment-14633675
 ] 

Nathan Roberts commented on HDFS-8791:
--

I forgot to mention that I'm pretty confident it's not the inodes, but rather 
the directory blocks. inodes have their own cache that I can control with 
vfs_cache_pressure. directory blocks however are just cached via the buffer 
cache (afaik), and the buffer cache is much more difficult to have any control 
over.

 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-17 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631334#comment-14631334
 ] 

Nathan Roberts commented on HDFS-8791:
--

Sure. Randomly sampled node had 4 4TB drives, each had right around 38250 
directories with at least one block, so 64K-38250 were empty. The drives were 
about 80% full. 

I see how there might be an optimization there, but I think we need to find a 
way to solve it for the more general case. Either the DN must never scan (or at 
least scan at a rate that will not be intrusive), or maybe we should reconsider 
the 64K breadth - a small number of files per directory is probably going to 
cause performance issues on many filesystems. 


 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.1
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8791:


 Summary: block ID-based DN storage layout can be very slow for 
datanode on ext4
 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.1
Reporter: Nathan Roberts
Priority: Critical


We are seeing cases where the new directory layout causes the datanode to 
basically cause the disks to seek for 10s of minutes. This can be when the 
datanode is running du, and it can also be when it is performing a checkDirs(). 
Both of these operations currently scan all directories in the block pool and 
that's very expensive in the new layout.

The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf 
directories where block files are placed.

So, what we have on disk is:
- 256 inodes for the first level directories
- 256 directory blocks for the first level directories
- 256*256 inodes for the second level directories
- 256*256 directory blocks for the second level directories
- Then the inodes and blocks to store the the HDFS blocks themselves.

The main problem is the 256*256 directory blocks. 

inodes and dentries will be cached by linux and one can configure how likely 
the system is to prune those entries (vfs_cache_pressure). However, ext4 relies 
on the buffer cache to cache the directory blocks and I'm not aware of any way 
to tell linux to favor buffer cache pages (even if it did I'm not sure I would 
want it to in general).

Also, ext4 tries hard to spread directories evenly across the entire volume, 
this basically means the 64K directory blocks are probably randomly spread 
across the entire disk. A du type scan will look at directories one at a time, 
so the ioscheduler can't optimize the corresponding seeks, meaning the seeks 
will be random and far. 

In a system I was using to diagnose this, I had 60K blocks. A DU when things 
are hot is less than 1 second. When things are cold, about 20 minutes.

How do things get cold?
- A large set of tasks run on the node. This pushes almost all of the buffer 
cache out, causing the next DU to hit this situation. We are seeing cases where 
a large job can cause a seek storm across the entire cluster.

Why didn't the previous layout see this?
- It might have but it wasn't nearly as pronounced. The previous layout would 
be a few hundred directory blocks. Even when completely cold, these would only 
take a few a hundred seeks which would mean single digit seconds.  
- With only a few hundred directories, the odds of the directory blocks getting 
modified is quite high, this keeps those blocks hot and much less likely to be 
evicted.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-07-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630493#comment-14630493
 ] 

Nathan Roberts commented on HDFS-8791:
--

For reference, the stack trace when du is obviously blocking on disk I/O
{noformat}
[811bf1a0] sync_buffer+0x40/0x50
[811bf156] __wait_on_buffer+0x26/0x30
[a02fd9a4] ext4_bread+0x64/0x80 [ext4]
[a0302aa8] htree_dirblock_to_tree+0x38/0x190 [ext4]
[a0303548] ext4_htree_fill_tree+0xa8/0x260 [ext4]
[a02f43c7] ext4_readdir+0x127/0x700 [ext4]
[8119f030] vfs_readdir+0xc0/0xe0
[8119f1b9] sys_getdents+0x89/0xf0
[8100b072] system_call_fastpath+0x16/0x1b
{noformat}

 block ID-based DN storage layout can be very slow for datanode on ext4
 --

 Key: HDFS-8791
 URL: https://issues.apache.org/jira/browse/HDFS-8791
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.1
Reporter: Nathan Roberts
Priority: Critical

 We are seeing cases where the new directory layout causes the datanode to 
 basically cause the disks to seek for 10s of minutes. This can be when the 
 datanode is running du, and it can also be when it is performing a 
 checkDirs(). Both of these operations currently scan all directories in the 
 block pool and that's very expensive in the new layout.
 The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
 leaf directories where block files are placed.
 So, what we have on disk is:
 - 256 inodes for the first level directories
 - 256 directory blocks for the first level directories
 - 256*256 inodes for the second level directories
 - 256*256 directory blocks for the second level directories
 - Then the inodes and blocks to store the the HDFS blocks themselves.
 The main problem is the 256*256 directory blocks. 
 inodes and dentries will be cached by linux and one can configure how likely 
 the system is to prune those entries (vfs_cache_pressure). However, ext4 
 relies on the buffer cache to cache the directory blocks and I'm not aware of 
 any way to tell linux to favor buffer cache pages (even if it did I'm not 
 sure I would want it to in general).
 Also, ext4 tries hard to spread directories evenly across the entire volume, 
 this basically means the 64K directory blocks are probably randomly spread 
 across the entire disk. A du type scan will look at directories one at a 
 time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
 seeks will be random and far. 
 In a system I was using to diagnose this, I had 60K blocks. A DU when things 
 are hot is less than 1 second. When things are cold, about 20 minutes.
 How do things get cold?
 - A large set of tasks run on the node. This pushes almost all of the buffer 
 cache out, causing the next DU to hit this situation. We are seeing cases 
 where a large job can cause a seek storm across the entire cluster.
 Why didn't the previous layout see this?
 - It might have but it wasn't nearly as pronounced. The previous layout would 
 be a few hundred directory blocks. Even when completely cold, these would 
 only take a few a hundred seeks which would mean single digit seconds.  
 - With only a few hundred directories, the odds of the directory blocks 
 getting modified is quite high, this keeps those blocks hot and much less 
 likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4660) Block corruption can happen during pipeline recovery

2015-06-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588695#comment-14588695
 ] 

Nathan Roberts commented on HDFS-4660:
--

+1 on the patch. I have reviewed the patch previously and it is currently 
running in production at scale. 

The stress test we ran against this in 
https://issues.apache.org/jira/browse/HDFS-4660?focusedCommentId=14542862page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14542862
 heavily exercised this path. 


 Block corruption can happen during pipeline recovery
 

 Key: HDFS-4660
 URL: https://issues.apache.org/jira/browse/HDFS-4660
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.0.3-alpha
Reporter: Peng Zhang
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-4660.patch, HDFS-4660.patch, HDFS-4660.v2.patch


 pipeline DN1  DN2  DN3
 stop DN2
 pipeline added node DN4 located at 2nd position
 DN1  DN4  DN3
 recover RBW
 DN4 after recover rbw
 2013-04-01 21:02:31,570 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
 RBW replica 
 BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
 2013-04-01 21:02:31,570 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
 Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
   getNumBytes() = 134144
   getBytesOnDisk() = 134144
   getVisibleLength()= 134144
 end at chunk (134144/512=262)
 DN3 after recover rbw
 2013-04-01 21:02:31,575 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
 RBW replica 
 BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01
  21:02:31,575 INFO 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
 Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
   getNumBytes() = 134028 
   getBytesOnDisk() = 134028
   getVisibleLength()= 134028
 client send packet after recover pipeline
 offset=133632  len=1008
 DN4 after flush 
 2013-04-01 21:02:31,779 DEBUG 
 org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
 offset:134640; meta offset:1063
 // meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 
 1063.
 DN3 after flush
 2013-04-01 21:02:31,782 DEBUG 
 org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: 
 BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, 
 type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, 
 lastPacketInBlock=false, offsetInBlock=134640, 
 ackEnqueueNanoTime=8817026136871545)
 2013-04-01 21:02:31,782 DEBUG 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing 
 meta file offset of block 
 BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 
 1055 to 1051
 2013-04-01 21:02:31,782 DEBUG 
 org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file 
 offset:134640; meta offset:1059
 After checking meta on DN4, I found checksum of chunk 262 is duplicated, but 
 data not.
 Later after block was finalized, DN4's scanner detected bad block, and then 
 reported it to NM. NM send a command to delete this block, and replicate this 
 block from other DN in pipeline to satisfy duplication num.
 I think this is because in BlockReceiver it skips data bytes already written, 
 but not skips checksum bytes already written. And function 
 adjustCrcFilePosition is only used for last non-completed chunk, but
 not for this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-8404:
-
Attachment: HDFS-8404-v1.patch

Thanks [~kihwal]. Updated based on latest trunk.

 pending block replication can get stuck using older genstamp
 

 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0, 2.7.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-8404-v0.patch, HDFS-8404-v1.patch


 If an under-replicated block gets into the pending-replication list, but 
 later the  genstamp of that block ends up being newer than the one originally 
 submitted for replication, the block will fail replication until the NN is 
 restarted. 
 It will be safer if processPendingReplications()  gets up-to-date blockinfo 
 before resubmitting replication work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-8404:
-
Status: Open  (was: Patch Available)

 pending block replication can get stuck using older genstamp
 

 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.0, 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-8404-v0.patch


 If an under-replicated block gets into the pending-replication list, but 
 later the  genstamp of that block ends up being newer than the one originally 
 submitted for replication, the block will fail replication until the NN is 
 restarted. 
 It will be safer if processPendingReplications()  gets up-to-date blockinfo 
 before resubmitting replication work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-15 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-8404:
-
Status: Patch Available  (was: Open)

 pending block replication can get stuck using older genstamp
 

 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.0, 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-8404-v0.patch, HDFS-8404-v1.patch


 If an under-replicated block gets into the pending-replication list, but 
 later the  genstamp of that block ends up being newer than the one originally 
 submitted for replication, the block will fail replication until the NN is 
 restarted. 
 It will be safer if processPendingReplications()  gets up-to-date blockinfo 
 before resubmitting replication work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-14 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-8404:


 Summary: pending block replication can get stuck using older 
genstamp
 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.0, 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


If an under-replicated block gets into the pending-replication list, but later 
the  genstamp of that block ends up being newer than the one originally 
submitted for replication, the block will fail replication until the NN is 
restarted. 

It will be safer if processPendingReplications()  gets up-to-date blockinfo 
before resubmitting replication work.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-14 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-8404:
-
Attachment: HDFS-8404-v0.patch

 pending block replication can get stuck using older genstamp
 

 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0, 2.7.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-8404-v0.patch


 If an under-replicated block gets into the pending-replication list, but 
 later the  genstamp of that block ends up being newer than the one originally 
 submitted for replication, the block will fail replication until the NN is 
 restarted. 
 It will be safer if processPendingReplications()  gets up-to-date blockinfo 
 before resubmitting replication work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8404) pending block replication can get stuck using older genstamp

2015-05-14 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-8404:
-
Status: Patch Available  (was: Open)

 pending block replication can get stuck using older genstamp
 

 Key: HDFS-8404
 URL: https://issues.apache.org/jira/browse/HDFS-8404
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.0, 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-8404-v0.patch


 If an under-replicated block gets into the pending-replication list, but 
 later the  genstamp of that block ends up being newer than the one originally 
 submitted for replication, the block will fail replication until the NN is 
 restarted. 
 It will be safer if processPendingReplications()  gets up-to-date blockinfo 
 before resubmitting replication work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8041) Consider remaining space during block blockplacement if dfs space is highly utilized

2015-04-02 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393437#comment-14393437
 ] 

Nathan Roberts commented on HDFS-8041:
--

Hi [~kihwal]. Some minor comments on the patch
+ Can we bounds check the new config? I think it works fine even without it but 
just to be safe against a change to the algorithm in the future.
+ I wish there was a way to make this config refreshable. Unfortunately I don't 
think that's possible today. 
+ Should we protect against stats.getNumDatanodesInService being 0. Again, 
probably ok as it is today but just to avoid a future patch from breaking the 
assumptions.
+ Node local writes are not impacted by the change. Maybe we should also have 
rack-local writes avoid this check so that the 2nd and 3rd replicas remain in 
the same rack. I think just having this impact the completely random target 
selections might be enough to avoid the problem while minimizing the affects on 
block placement.

 Consider remaining space during block blockplacement if dfs space is highly 
 utilized
 

 Key: HDFS-8041
 URL: https://issues.apache.org/jira/browse/HDFS-8041
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Kihwal Lee
Assignee: Kihwal Lee
 Attachments: HDFS-8041.v1.patch, HDFS-8041.v2.patch


 This feature is helpful in avoiding smaller nodes (i.e. heterogeneous 
 environment) getting constantly being full when the overall space utilization 
 is over a certain threshold.  When the utilization is low, balancer can keep 
 up, but once the average per-node byte goes over the capacity of the smaller 
 nodes, they get full so quickly even after perfect balance.
 This jira proposes an improvement that can be optionally enabled in order to 
 slow down the rate of space usage growth of smaller nodes if the overall 
 storage utilization is over a configured threshold.  It will not replace 
 balancer, rather will help balancer keep up. Also, the primary replica 
 placement will not be affected. Only the replicas typically placed in a 
 remote rack will be subject to this check.
 The appropriate threshold is cluster configuration specific. There is no 
 generally good value to set, thus it is disabled by default. We have seen 
 cases where the threshold of 85% - 90% would help. Figuring when 
 {{totalSpaceUsed / numNodes}} becomes close to the capacity of a smaller node 
 is helpful in determining the threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-03-30 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386724#comment-14386724
 ] 

Nathan Roberts commented on HDFS-7742:
--

Thanks for the review nicholas!

There is a test in the patch. Are you asking for a specific test case to be 
added?

The test failure from the QA bot (TestMalformedURLs) should be unrelated. 

 favoring decommissioning node for replication can cause a block to stay 
 underreplicated for long periods
 

 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-7742-v0.patch


 When choosing a source node to replicate a block from, a decommissioning node 
 is favored. The reason for the favoritism is that decommissioning nodes 
 aren't servicing any writes so in-theory they are less loaded.
 However, the same selection algorithm also tries to make sure it doesn't get 
 stuck on any particular node:
 {noformat}
   // switch to a different node randomly
   // this to prevent from deterministically selecting the same node even
   // if the node failed to replicate the block on previous iterations
 {noformat}
 Unfortunately, the decommissioning check is prior to this randomness so the 
 algorithm can get stuck trying to replicate from a decommissioning node. 
 We've seen this in practice where a decommissioning datanode was failing to 
 replicate a block for many days, when other viable replicas of the block were 
 available.
 Given that we limit the number of streams we'll assign to a given node 
 (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
 decommissioning node has significant benefit. i.e. when there is significant 
 replication work to do, we'll quickly hit the stream limit of the 
 decommissioning nodes and use other nodes in the cluster anyway; when there 
 isn't significant replication work then in theory we've got plenty of 
 replication bandwidth available so choosing a decommissioning node isn't much 
 of a win.
 I see two choices:
 1) Change the algorithm to still favor decommissioning nodes but with some 
 level of randomness that will avoid always selecting the decommissioning node
 2) Remove the favoritism for decommissioning nodes
 I prefer #2. It simplifies the algorithm, and given the other throttles we 
 have in place, I'm not sure there is a significant benefit to selecting 
 decommissioning nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-03-27 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-7742:
-
Target Version/s: 3.0.0  (was: 3.0.0, 2.7.0)

 favoring decommissioning node for replication can cause a block to stay 
 underreplicated for long periods
 

 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-7742-v0.patch


 When choosing a source node to replicate a block from, a decommissioning node 
 is favored. The reason for the favoritism is that decommissioning nodes 
 aren't servicing any writes so in-theory they are less loaded.
 However, the same selection algorithm also tries to make sure it doesn't get 
 stuck on any particular node:
 {noformat}
   // switch to a different node randomly
   // this to prevent from deterministically selecting the same node even
   // if the node failed to replicate the block on previous iterations
 {noformat}
 Unfortunately, the decommissioning check is prior to this randomness so the 
 algorithm can get stuck trying to replicate from a decommissioning node. 
 We've seen this in practice where a decommissioning datanode was failing to 
 replicate a block for many days, when other viable replicas of the block were 
 available.
 Given that we limit the number of streams we'll assign to a given node 
 (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
 decommissioning node has significant benefit. i.e. when there is significant 
 replication work to do, we'll quickly hit the stream limit of the 
 decommissioning nodes and use other nodes in the cluster anyway; when there 
 isn't significant replication work then in theory we've got plenty of 
 replication bandwidth available so choosing a decommissioning node isn't much 
 of a win.
 I see two choices:
 1) Change the algorithm to still favor decommissioning nodes but with some 
 level of randomness that will avoid always selecting the decommissioning node
 2) Remove the favoritism for decommissioning nodes
 I prefer #2. It simplifies the algorithm, and given the other throttles we 
 have in place, I'm not sure there is a significant benefit to selecting 
 decommissioning nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-03-27 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-7742:
-
Attachment: HDFS-7742-v0.patch

Attached patch. Favors decommissioning nodes a bit by allowing them to go up to 
hard limit, otherwise not at all.

 favoring decommissioning node for replication can cause a block to stay 
 underreplicated for long periods
 

 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-7742-v0.patch


 When choosing a source node to replicate a block from, a decommissioning node 
 is favored. The reason for the favoritism is that decommissioning nodes 
 aren't servicing any writes so in-theory they are less loaded.
 However, the same selection algorithm also tries to make sure it doesn't get 
 stuck on any particular node:
 {noformat}
   // switch to a different node randomly
   // this to prevent from deterministically selecting the same node even
   // if the node failed to replicate the block on previous iterations
 {noformat}
 Unfortunately, the decommissioning check is prior to this randomness so the 
 algorithm can get stuck trying to replicate from a decommissioning node. 
 We've seen this in practice where a decommissioning datanode was failing to 
 replicate a block for many days, when other viable replicas of the block were 
 available.
 Given that we limit the number of streams we'll assign to a given node 
 (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
 decommissioning node has significant benefit. i.e. when there is significant 
 replication work to do, we'll quickly hit the stream limit of the 
 decommissioning nodes and use other nodes in the cluster anyway; when there 
 isn't significant replication work then in theory we've got plenty of 
 replication bandwidth available so choosing a decommissioning node isn't much 
 of a win.
 I see two choices:
 1) Change the algorithm to still favor decommissioning nodes but with some 
 level of randomness that will avoid always selecting the decommissioning node
 2) Remove the favoritism for decommissioning nodes
 I prefer #2. It simplifies the algorithm, and given the other throttles we 
 have in place, I'm not sure there is a significant benefit to selecting 
 decommissioning nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-03-27 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-7742:
-
Target Version/s: 3.0.0, 2.7.0  (was: 3.0.0)
  Status: Patch Available  (was: Open)

 favoring decommissioning node for replication can cause a block to stay 
 underreplicated for long periods
 

 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-7742-v0.patch


 When choosing a source node to replicate a block from, a decommissioning node 
 is favored. The reason for the favoritism is that decommissioning nodes 
 aren't servicing any writes so in-theory they are less loaded.
 However, the same selection algorithm also tries to make sure it doesn't get 
 stuck on any particular node:
 {noformat}
   // switch to a different node randomly
   // this to prevent from deterministically selecting the same node even
   // if the node failed to replicate the block on previous iterations
 {noformat}
 Unfortunately, the decommissioning check is prior to this randomness so the 
 algorithm can get stuck trying to replicate from a decommissioning node. 
 We've seen this in practice where a decommissioning datanode was failing to 
 replicate a block for many days, when other viable replicas of the block were 
 available.
 Given that we limit the number of streams we'll assign to a given node 
 (default soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
 decommissioning node has significant benefit. i.e. when there is significant 
 replication work to do, we'll quickly hit the stream limit of the 
 decommissioning nodes and use other nodes in the cluster anyway; when there 
 isn't significant replication work then in theory we've got plenty of 
 replication bandwidth available so choosing a decommissioning node isn't much 
 of a win.
 I see two choices:
 1) Change the algorithm to still favor decommissioning nodes but with some 
 level of randomness that will avoid always selecting the decommissioning node
 2) Remove the favoritism for decommissioning nodes
 I prefer #2. It simplifies the algorithm, and given the other throttles we 
 have in place, I'm not sure there is a significant benefit to selecting 
 decommissioning nodes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7742) favoring decommissioning node for replication can cause a block to stay underreplicated for long periods

2015-02-06 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-7742:


 Summary: favoring decommissioning node for replication can cause a 
block to stay underreplicated for long periods
 Key: HDFS-7742
 URL: https://issues.apache.org/jira/browse/HDFS-7742
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


When choosing a source node to replicate a block from, a decommissioning node 
is favored. The reason for the favoritism is that decommissioning nodes aren't 
servicing any writes so in-theory they are less loaded.

However, the same selection algorithm also tries to make sure it doesn't get 
stuck on any particular node:
{noformat}
  // switch to a different node randomly
  // this to prevent from deterministically selecting the same node even
  // if the node failed to replicate the block on previous iterations
{noformat}
Unfortunately, the decommissioning check is prior to this randomness so the 
algorithm can get stuck trying to replicate from a decommissioning node. We've 
seen this in practice where a decommissioning datanode was failing to replicate 
a block for many days, when other viable replicas of the block were available.

Given that we limit the number of streams we'll assign to a given node (default 
soft limit of 2, hard limit of 4), It doesn't seem like favoring a 
decommissioning node has significant benefit. i.e. when there is significant 
replication work to do, we'll quickly hit the stream limit of the 
decommissioning nodes and use other nodes in the cluster anyway; when there 
isn't significant replication work then in theory we've got plenty of 
replication bandwidth available so choosing a decommissioning node isn't much 
of a win.

I see two choices:
1) Change the algorithm to still favor decommissioning nodes but with some 
level of randomness that will avoid always selecting the decommissioning node
2) Remove the favoritism for decommissioning nodes

I prefer #2. It simplifies the algorithm, and given the other throttles we have 
in place, I'm not sure there is a significant benefit to selecting 
decommissioning nodes. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2015-01-20 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-7645:


 Summary: Rolling upgrade is restoring blocks from trash multiple 
times
 Key: HDFS-7645
 URL: https://issues.apache.org/jira/browse/HDFS-7645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts


When performing an HDFS rolling upgrade, the trash directory is getting 
restored twice when under normal circumstances it shouldn't need to be restored 
at all. iiuc, the only time these blocks should be restored is if we need to 
rollback a rolling upgrade. 

On a busy cluster, this can cause significant and unnecessary block churn both 
on the datanodes, and more importantly in the namenode.

The two times this happens are:
1) restart of DN onto new software
{code}
  private void doTransition(DataNode datanode, StorageDirectory sd,
  NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
if (startOpt == StartupOption.ROLLBACK  sd.getPreviousDir().exists()) {
  Preconditions.checkState(!getTrashRootDir(sd).exists(),
  sd.getPreviousDir() +  and  + getTrashRootDir(sd) +  should not  +
   both be present.);
  doRollback(sd, nsInfo); // rollback if applicable
} else {
  // Restore all the files in the trash. The restored files are retained
  // during rolling upgrade rollback. They are deleted during rolling
  // upgrade downgrade.
  int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
  LOG.info(Restored  + restored +  block files from trash.);
}
{code}

2) When heartbeat response no longer indicates a rollingupgrade is in progress
{code}
  /**
   * Signal the current rolling upgrade status as indicated by the NN.
   * @param inProgress true if a rolling upgrade is in progress
   */
  void signalRollingUpgrade(boolean inProgress) throws IOException {
String bpid = getBlockPoolId();
if (inProgress) {
  dn.getFSDataset().enableTrash(bpid);
  dn.getFSDataset().setRollingUpgradeMarker(bpid);
} else {
  dn.getFSDataset().restoreTrash(bpid);
  dn.getFSDataset().clearRollingUpgradeMarker(bpid);
}
  }
{code}

HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-12 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273764#comment-14273764
 ] 

Nathan Roberts commented on HDFS-7548:
--

- I think we need to prioritize a scan for that block.

- Also, some comments on addBlockToFirstLocation().
  - imo, WARN should be INFO. 
  - If this block has been scanned in the last 5 minutes (or some reasonable 
time frame), then maybe we shouldn't add it back to the list of blocks to be 
scanned. If all IOExceptions are going to re-prioritize the scan of a block, 
having a minimum delay between scans would avoid corner cases where a network 
glitch or badly behaving clients are causing IOExceptions that don't really 
warrant rescans.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7548) Corrupt block reporting delayed until datablock scanner thread detects it

2015-01-07 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268347#comment-14268347
 ] 

Nathan Roberts commented on HDFS-7548:
--

I think we need to handle the java.io.IOException: Input/output error case as 
well since this is what we'll see if having trouble reading from disk.

 Corrupt block reporting delayed until datablock scanner thread detects it
 -

 Key: HDFS-7548
 URL: https://issues.apache.org/jira/browse/HDFS-7548
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Attachments: HDFS-7548.patch


 When there is one datanode holding the block and that block happened to be
 corrupt, namenode would keep on trying to replicate the block repeatedly but 
 it would only report the block as corrupt only when the data block scanner 
 thread of the datanode picks up this bad block.
 Requesting improvement in namenode reporting so that corrupt replica would be 
 reported when there is only 1 replica and the replication of that replica 
 keeps on failing with the checksum error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-2825) Add test hook to turn off the writer preferring its local DN

2014-12-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248733#comment-14248733
 ] 

Nathan Roberts commented on HDFS-2825:
--

[~tlipcon] - It might help. I'm a little concerned about the additional round 
trip and how we might enable it globally for an entire cluster. We'll be 
running some experiments with a global config for the block placement policy, 
and then go from there.

 Add test hook to turn off the writer preferring its local DN
 

 Key: HDFS-2825
 URL: https://issues.apache.org/jira/browse/HDFS-2825
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Fix For: 0.24.0, 0.23.1

 Attachments: hdfs-2825.txt, hdfs-2825.txt


 Currently, the default block placement policy always places the first replica 
 in the pipeline on the local node if there is a valid DN running there. In 
 some network designs, within-rack bandwidth is never constrained so this 
 doesn't give much of an advantage. It would also be really useful to disable 
 this for MiniDFSCluster tests, since currently if you start a multi-DN 
 cluster and write with replication level 1, all of the replicas go to the 
 same DN.
 _[per discussion below, this was changed to not add a config, but only to add 
 a hook for testing]_



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-2825) Add test hook to turn off the writer preferring its local DN

2014-12-02 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232257#comment-14232257
 ] 

Nathan Roberts commented on HDFS-2825:
--

Any chance folks have altered their view on this and would accept a config 
option in the default placement policy? I'm happy to file the jira and put up 
the patch. It's very straightforward and makes sense for a lot of cases. imho, 
it actually makes more sense as the default than node-local (except for obvious 
things like Hbase). Today with node-local-first we wind up with some very hot 
nodes that dramatically slow down subsequent jobs, all because there is an 
entire copy of a large file on a single node of a rack.

 Add test hook to turn off the writer preferring its local DN
 

 Key: HDFS-2825
 URL: https://issues.apache.org/jira/browse/HDFS-2825
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 0.23.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor
 Fix For: 0.24.0, 0.23.1

 Attachments: hdfs-2825.txt, hdfs-2825.txt


 Currently, the default block placement policy always places the first replica 
 in the pipeline on the local node if there is a valid DN running there. In 
 some network designs, within-rack bandwidth is never constrained so this 
 doesn't give much of an advantage. It would also be really useful to disable 
 this for MiniDFSCluster tests, since currently if you start a multi-DN 
 cluster and write with replication level 1, all of the replicas go to the 
 same DN.
 _[per discussion below, this was changed to not add a config, but only to add 
 a hook for testing]_



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list

2014-07-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064095#comment-14064095
 ] 

Nathan Roberts commented on HDFS-6658:
--

{quote}
I guess my argument is that (in the short or medium term) we don't actually 
need to reduce the amount of RAM the NameNode uses. I've seen machines with 300 
GB of RAM, and sizes continue to increase at a steady clip every year. We do 
need to reduce the amount of Java heap that the NameNode uses, since otherwise 
we get 10 minute long GC pauses.
{quote}
This is a pretty sizable improvement though so it seems well worth considering. 
* One thing I'm concerned about is the increased RAM requirements that have 
been going on in the NN. For example, moving from 0.23 releases to 2.x releases 
requires about 9% more RAM (I'm assuming it's something similar when going from 
1.x to 2.x). This is a pretty big deal and can cause some folks to fail their 
upgrade if they were living close to the edge. In my opinion we need to be very 
careful whenever we increase the RAM requirements of the NN. For every increase 
there should be a corresponding optimization so the net increase stays as close 
to 0 as possible. Otherwise, some upgrades will certainly fail. 
* I'm not totally convinced of the long GC argument. It's true that a worst 
case full-gc will be much longer. However, isn't it also the case that we 
should almost never be doing worst case full-GCs? On a large and busy NN, we 
see a GC greater than 2 seconds maybe once every couple of days. Usually the 
big outliers are the result of a very large application doing something bad - 
in which case even if you solve the GC problem, something else is liable to 
cause the NN to be unresponsive. 


 Namenode memory optimization - Block replicas list 
 ---

 Key: HDFS-6658
 URL: https://issues.apache.org/jira/browse/HDFS-6658
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.1
Reporter: Amir Langer
Assignee: Amir Langer
 Attachments: Namenode Memory Optimizations - Block replicas list.docx


 Part of the memory consumed by every BlockInfo object in the Namenode is a 
 linked list of block references for every DatanodeStorageInfo (called 
 triplets). 
 We propose to change the way we store the list in memory. 
 Using primitive integer indexes instead of object references will reduce the 
 memory needed for every block replica (when compressed oops is disabled) and 
 in our new design the list overhead will be per DatanodeStorageInfo and not 
 per block replica.
 see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list

2014-07-14 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060740#comment-14060740
 ] 

Nathan Roberts commented on HDFS-6658:
--

Maybe have a simple fragmentation metric and if it exceeds X% for an extended 
period of time (like hours), then clean it up. Yes, some client will have 
higher latency. But it's only once in many hours and I doubt it's for very long 
anyway (milliseconds). It's kind of a bazaar situation so I don't think we're 
in a hurry to clean it up, but it's also better if we don't let it sit around 
forever.

 Namenode memory optimization - Block replicas list 
 ---

 Key: HDFS-6658
 URL: https://issues.apache.org/jira/browse/HDFS-6658
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.1
Reporter: Amir Langer
Assignee: Amir Langer
 Attachments: Namenode Memory Optimizations - Block replicas list.docx


 Part of the memory consumed by every BlockInfo object in the Namenode is a 
 linked list of block references for every DatanodeStorageInfo (called 
 triplets). 
 We propose to change the way we store the list in memory. 
 Using primitive integer indexes instead of object references will reduce the 
 memory needed for every block replica (when compressed oops is disabled) and 
 in our new design the list overhead will be per DatanodeStorageInfo and not 
 per block replica.
 see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6584) Support archival storage

2014-07-14 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14060815#comment-14060815
 ] 

Nathan Roberts commented on HDFS-6584:
--

As part of this work, would it make sense to extend the policy functionality so 
that we can control the BlockPlacementPolicy used during file creation? Here's 
the use case which we run into quite frequently: 1-2GB file generated by JobA 
is used as a distributed cache file for Job B which is rather large (several 
thousand tasks). Assuming a single task from JobA writes this file, an entire 
copy of the file will be on a single node, and in no other nodes in that same 
rack (default policy is 1st blk local, 2nd replica on remote rack, 3rd replica 
on same rack as 2nd). When a large job needs this distributed cache file, the 
node where there is a single copy will become a significant bottleneck and is 
likely to cause localization timeouts. This is with replication factors set to 
50+, so just increasing the replication factor does not solve this problem. It 
would be good if JobA could specify a BlockPlacementPolicy which would do 1st 
replica rack local, 2nd replica remote rack, 3rd replica same as 2nd (in 
general though it would be good if JobA could ask for any 1 of n placement 
policies). 

 Support archival storage
 

 Key: HDFS-6584
 URL: https://issues.apache.org/jira/browse/HDFS-6584
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, namenode
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: HDFSArchivalStorageDesign20140623.pdf


 In most of the Hadoop clusters, as more and more data is stored for longer 
 time, the demand for storage is outstripping the compute. Hadoop needs a cost 
 effective and easy to manage solution to meet this demand for storage. 
 Current solution is:
 - Delete the old unused data. This comes at operational cost of identifying 
 unnecessary data and deleting them manually.
 - Add more nodes to the clusters. This adds along with storage capacity 
 unnecessary compute capacity to the cluster.
 Hadoop needs a solution to decouple growing storage capacity from compute 
 capacity. Nodes with higher density and less expensive storage with low 
 compute power are becoming available and can be used as cold storage in the 
 clusters. Based on policy the data from hot storage can be moved to cold 
 storage. Adding more nodes to the cold storage can grow the storage 
 independent of the compute capacity in the cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6658) Namenode memory optimization - Block replicas list

2014-07-10 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057858#comment-14057858
 ] 

Nathan Roberts commented on HDFS-6658:
--

Thanks Amir. Very attractive memory savings. 

 Namenode memory optimization - Block replicas list 
 ---

 Key: HDFS-6658
 URL: https://issues.apache.org/jira/browse/HDFS-6658
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.4.1
Reporter: Amir Langer
Assignee: Amir Langer
 Attachments: Namenode Memory Optimizations - Block replicas list.docx


 Part of the memory consumed by every BlockInfo object in the Namenode is a 
 linked list of block references for every DatanodeStorageInfo (called 
 triplets). 
 We propose to change the way we store the list in memory. 
 Using primitive integer indexes instead of object references will reduce the 
 memory needed for every block replica (when compressed oops is disabled) and 
 in our new design the list overhead will be per DatanodeStorageInfo and not 
 per block replica.
 see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6407) new namenode UI, lost ability to sort columns in datanode tab

2014-05-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-6407:


 Summary: new namenode UI, lost ability to sort columns in datanode 
tab
 Key: HDFS-6407
 URL: https://issues.apache.org/jira/browse/HDFS-6407
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0
Reporter: Nathan Roberts
Priority: Minor


old ui supported clicking on column header to sort on that column. The new ui 
seems to have dropped this very useful feature.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6247) Avoid timeouts for replaceBlock() call by sending intermediate responses to Balancer

2014-04-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971837#comment-13971837
 ] 

Nathan Roberts commented on HDFS-6247:
--

+1  This will be better than the longish timeout that's in there currently. 
Would be good to shorten the read time out back down once this sort of 
heartbeat is in place. 

 Avoid timeouts for replaceBlock() call by sending intermediate responses to 
 Balancer
 

 Key: HDFS-6247
 URL: https://issues.apache.org/jira/browse/HDFS-6247
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer, datanode
Affects Versions: 2.4.0
Reporter: Vinayakumar B
Assignee: Vinayakumar B

 Currently there is no response sent from target Datanode to Balancer for the 
 replaceBlock() calls.
 Since the Block movement for balancing is throttled, complete block movement 
 will take time and this could result in timeout at Balancer, which will be 
 trying to read the status message.
  
 To Avoid this during replaceBlock() call in in progress Datanode  can send 
 IN_PROGRESS status messages to Balancer to avoid timeouts and treat 
 BlockMovement as  failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6166) revisit balancer so_timeout

2014-03-31 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-6166:
-

Attachment: HDFS-6166-branch23.patch

Patch for branch 23

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.3.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Fix For: 2.4.0

 Attachments: HDFS-6166-branch23.patch, HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6166) revisit balancer so_timeout

2014-03-28 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951461#comment-13951461
 ] 

Nathan Roberts commented on HDFS-6166:
--

Tested on a 400 node cluster with a bandwidth of 500K/sec. Verified that there 
are still occasional  timeouts BUT there is not the flood of thread quota 
exceeded warnings.

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.3.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Attachments: HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6166) revisit balancer so_timeout

2014-03-28 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951493#comment-13951493
 ] 

Nathan Roberts commented on HDFS-6166:
--

Maybe our two comments passed in the mail. Yes I tested internally. It's been 
running on a 400 node cluster for 1 day. I ran with bandwidths of 500K, 6MB, 
20MB. With 500K there were timeouts, but no thread quota exceeded failures.

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.3.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Attachments: HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6166) revisit balancer so_timeout

2014-03-28 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951509#comment-13951509
 ] 

Nathan Roberts commented on HDFS-6166:
--

The blocks don't have to be very large. There is a quota of 5 threads per DN, 
at the default bandwidth of 1MB/sec, it can take (block size) / (1MB/5)  
seconds to move a block (something like 640 seconds for a 128MB block). The 
bandwidth is dynamically settable and the block size is not constant either, so 
I went with the very simple approach that will cover the normal situations.

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.3.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Attachments: HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HDFS-6166) revisit balancer so_timeout

2014-03-27 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-6166:


 Summary: revisit balancer so_timeout 
 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.3.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker


HDFS-5806 changed the socket read timeout for the balancer connection to DN to 
60 seconds. This works as long as balancer bandwidth is such that it's safe to 
assume that the DN will easily complete the operation within this time. 
Obviously this isn't a good assumption. When this assumption isn't valid, the 
balancer will timeout the cmd BUT it will then be out-of-sync with the datanode 
(balancer thinks the DN has room to do more work, DN is still working on the 
request and will fail any subsequent requests with threads quota exceeded 
errors). This causes expensive NN traffic via getBlocks() and also causes lots 
of WARNS int the balancer log.

Unfortunately the protocol is such that it's impossible to tell if the DN is 
busy working on replacing the block, OR is in bad shape and will never finish.

So, in the interest of a small change to deal with both situations, I propose 
the following two changes:
* Crank of the socket read timeout to 20 minutes
* Delay looking at a node for a bit if we did timeout in this way (the DN could 
still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6166) revisit balancer so_timeout

2014-03-27 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-6166:
-

Attachment: HDFS-6166.patch

proposed patch.

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.3.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Attachments: HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6166) revisit balancer so_timeout

2014-03-27 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-6166:
-

Target Version/s: 3.0.0, 2.4.0  (was: 2.4.0)
  Status: Patch Available  (was: Open)

 revisit balancer so_timeout 
 

 Key: HDFS-6166
 URL: https://issues.apache.org/jira/browse/HDFS-6166
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.3.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
Priority: Blocker
 Attachments: HDFS-6166.patch


 HDFS-5806 changed the socket read timeout for the balancer connection to DN 
 to 60 seconds. This works as long as balancer bandwidth is such that it's 
 safe to assume that the DN will easily complete the operation within this 
 time. Obviously this isn't a good assumption. When this assumption isn't 
 valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
 the datanode (balancer thinks the DN has room to do more work, DN is still 
 working on the request and will fail any subsequent requests with threads 
 quota exceeded errors). This causes expensive NN traffic via getBlocks() and 
 also causes lots of WARNS int the balancer log.
 Unfortunately the protocol is such that it's impossible to tell if the DN is 
 busy working on replacing the block, OR is in bad shape and will never finish.
 So, in the interest of a small change to deal with both situations, I propose 
 the following two changes:
 * Crank of the socket read timeout to 20 minutes
 * Delay looking at a node for a bit if we did timeout in this way (the DN 
 could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-02-26 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5806:
-

Attachment: HDFS-5806-0.23.patch

0.23 version of patch

 balancer should set SoTimeout to avoid indefinite hangs
 ---

 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.2.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Fix For: 2.3.0

 Attachments: HDFS-5806-0.23.patch, HDFS-5806.patch


 Simple patch to avoid the balancer hanging when datanode stops responding to 
 requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5788) listLocatedStatus response can be very large

2014-01-22 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5788:
-

Attachment: HDFS-5788.patch

patch for trunk.

 listLocatedStatus response can be very large
 

 Key: HDFS-5788
 URL: https://issues.apache.org/jira/browse/HDFS-5788
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.0.0, 0.23.10, 2.2.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-5788.patch


 Currently we limit the size of listStatus requests to a default of 1000 
 entries. This works fine except in the case of listLocatedStatus where the 
 location information can be quite large. As an example, a directory with 7000 
 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is 
 over 1MB. This can chew up very large amounts of memory in the NN if lots of 
 clients try to do this simultaneously.
 Seems like it would be better if we also considered the amount of location 
 information being returned when deciding how many files to return.
 Patch will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5788) listLocatedStatus response can be very large

2014-01-22 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5788:
-

Status: Patch Available  (was: Open)

 listLocatedStatus response can be very large
 

 Key: HDFS-5788
 URL: https://issues.apache.org/jira/browse/HDFS-5788
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0, 0.23.10, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-5788.patch


 Currently we limit the size of listStatus requests to a default of 1000 
 entries. This works fine except in the case of listLocatedStatus where the 
 location information can be quite large. As an example, a directory with 7000 
 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is 
 over 1MB. This can chew up very large amounts of memory in the NN if lots of 
 clients try to do this simultaneously.
 Seems like it would be better if we also considered the amount of location 
 information being returned when deciding how many files to return.
 Patch will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-01-22 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13879273#comment-13879273
 ] 

Nathan Roberts commented on HDFS-5806:
--

Andrew, thanks for taking a look. Sorry about not mentioning the testing. 

Didn't have great ideas on how to test. Basically did the following
- Changed balancer so that sotimeout was 1 second
- Changed balancer so that sleeptime between iterations was 2 seconds
- Changed dispatch() within balancer to randomly not send the request - this 
causes the response read to timeout due to sotimeout
- Made sure TestBalancer still worked


 balancer should set SoTimeout to avoid indefinite hangs
 ---

 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.2.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-5806.patch


 Simple patch to avoid the balancer hanging when datanode stops responding to 
 requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-01-21 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5806:


 Summary: balancer should set SoTimeout to avoid indefinite hangs
 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.2.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Simple patch to avoid the balancer hanging when datanode stops responding to 
requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-01-21 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5806:
-

Attachment: HDFS-5806.patch

use setSoTimeout() to avoid read hangs.

 balancer should set SoTimeout to avoid indefinite hangs
 ---

 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 3.0.0, 2.2.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-5806.patch


 Simple patch to avoid the balancer hanging when datanode stops responding to 
 requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5806) balancer should set SoTimeout to avoid indefinite hangs

2014-01-21 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5806:
-

Status: Patch Available  (was: Open)

 balancer should set SoTimeout to avoid indefinite hangs
 ---

 Key: HDFS-5806
 URL: https://issues.apache.org/jira/browse/HDFS-5806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer
Affects Versions: 2.2.0, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts
 Attachments: HDFS-5806.patch


 Simple patch to avoid the balancer hanging when datanode stops responding to 
 requests. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (HDFS-5788) listLocatedStatus response can be very large

2014-01-16 Thread Nathan Roberts (JIRA)
Nathan Roberts created HDFS-5788:


 Summary: listLocatedStatus response can be very large
 Key: HDFS-5788
 URL: https://issues.apache.org/jira/browse/HDFS-5788
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.2.0, 0.23.10, 3.0.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts


Currently we limit the size of listStatus requests to a default of 1000 
entries. This works fine except in the case of listLocatedStatus where the 
location information can be quite large. As an example, a directory with 7000 
entries, 4 blocks each, 3 way replication - a listLocatedStatus response is 
over 1MB. This can chew up very large amounts of memory in the NN if lots of 
clients try to do this simultaneously.

Seems like it would be better if we also considered the amount of location 
information being returned when deciding how many files to return.

Patch will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HDFS-5788) listLocatedStatus response can be very large

2014-01-16 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13874107#comment-13874107
 ] 

Nathan Roberts commented on HDFS-5788:
--

A simple solution is:
Restrict the size to dfs.ls.limit (default 1000) files OR dfs.ls.limit block 
locations, whichever comes first (obviously always returning only whole 
entries, so we could send more than this number of locations)

Yes, it will require more RPCs. However, it would seem to lower the risk of a 
DoS.  

 listLocatedStatus response can be very large
 

 Key: HDFS-5788
 URL: https://issues.apache.org/jira/browse/HDFS-5788
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.0.0, 0.23.10, 2.2.0
Reporter: Nathan Roberts
Assignee: Nathan Roberts

 Currently we limit the size of listStatus requests to a default of 1000 
 entries. This works fine except in the case of listLocatedStatus where the 
 location information can be quite large. As an example, a directory with 7000 
 entries, 4 blocks each, 3 way replication - a listLocatedStatus response is 
 over 1MB. This can chew up very large amounts of memory in the NN if lots of 
 clients try to do this simultaneously.
 Seems like it would be better if we also considered the amount of location 
 information being returned when deciding how many files to return.
 Patch will follow shortly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-5477) Block manager as a service

2013-12-17 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5477:
-

Attachment: Standalone BM.pdf

Re-attach standalone pdf to fix graphics.

 Block manager as a service
 --

 Key: HDFS-5477
 URL: https://issues.apache.org/jira/browse/HDFS-5477
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
 Attachments: Proposal.pdf, Proposal.pdf, Standalone BM.pdf, 
 Standalone BM.pdf


 The block manager needs to evolve towards having the ability to run as a 
 standalone service to improve NN vertical and horizontal scalability.  The 
 goal is reducing the memory footprint of the NN proper to support larger 
 namespaces, and improve overall performance by decoupling the block manager 
 from the namespace and its lock.  Ideally, a distinct BM will be transparent 
 to clients and DNs.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HDFS-5477) Block manager as a service

2013-12-11 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5477:
-

Attachment: Proposal.pdf

Fix formatting problems in PDF.

 Block manager as a service
 --

 Key: HDFS-5477
 URL: https://issues.apache.org/jira/browse/HDFS-5477
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
 Attachments: Proposal.pdf, Proposal.pdf, Standalone BM.pdf


 The block manager needs to evolve towards having the ability to run as a 
 standalone service to improve NN vertical and horizontal scalability.  The 
 goal is reducing the memory footprint of the NN proper to support larger 
 namespaces, and improve overall performance by decoupling the block manager 
 from the namespace and its lock.  Ideally, a distinct BM will be transparent 
 to clients and DNs.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades

2013-12-02 Thread Nathan Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836540#comment-13836540
 ] 

Nathan Roberts commented on HDFS-5535:
--

Hi. Initial draft is still a little rough. Will try to get it up Tuesday for 
initial comments and suggestions.


 Umbrella jira for improved HDFS rolling upgrades
 

 Key: HDFS-5535
 URL: https://issues.apache.org/jira/browse/HDFS-5535
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, ha, hdfs-client, namenode
Affects Versions: 3.0.0, 2.2.0
Reporter: Nathan Roberts

 In order to roll a new HDFS release through a large cluster quickly and 
 safely, a few enhancements are needed in HDFS. An initial High level design 
 document will be attached to this jira, and sub-jiras will itemize the 
 individual tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HDFS-5496) Make replication queue initialization asynchronous

2013-11-21 Thread Nathan Roberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Roberts updated HDFS-5496:
-

Issue Type: Sub-task  (was: Improvement)
Parent: HDFS-5535

 Make replication queue initialization asynchronous
 --

 Key: HDFS-5496
 URL: https://issues.apache.org/jira/browse/HDFS-5496
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Kihwal Lee

 Today, initialization of replication queues blocks safe mode exit and certain 
 HA state transitions. For a big name space, this can take hundreds of seconds 
 with the FSNamesystem write lock held.  During this time, important requests 
 (e.g. initial block reports, heartbeat, etc) are blocked.
 The effect of delaying the initialization would be not starting replication 
 right away, but I think the benefit outweighs. If we make it asynchronous, 
 the work per iteration should be limited, so that the lock duration is 
 capped. 
 If full/incremental block reports and any other requests that modifies block 
 state properly performs replication checks while the blocks are scanned and 
 the queues populated in background, every block will be processed. (Some may 
 be done twice)  The replication monitor should run even before all blocks are 
 processed.
 This will allow namenode to exit safe mode and start serving immediately even 
 with a big name space. It will also reduce the HA failover latency.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   >