[jira] [Commented] (HDFS-7721) The HDFS BlockScanner may run fast during the first hour

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302876#comment-14302876
 ] 

Hadoop QA commented on HDFS-7721:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696032/HDFS-7721.001.patch
  against trunk revision 8cb4731.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9408//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9408//console

This message is automatically generated.

> The HDFS BlockScanner may run fast during the first hour
> 
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7726:
-
Attachment: check_config_EditLogTailer.patch

The refined patch

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_EditLogTailer.patch, 
> check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302822#comment-14302822
 ] 

Tianyin Xu commented on HDFS-7726:
--

Thanks a lot, Zhe! I refined the patch based on your feedback.

1. rpcTimeout could be negative. It's passed to the RPC protocol 
(org.apache.hadoop.ipc.RPC.java) which does not explicitly define the range of 
rpcTimeout. I tested negative values and it works fine.  Let me know if you 
wanna check rpcTimeout to be positive. 
2. Yes, the new patch uses  Preconditions.checkArgument
3. Fixed!
4. Removed the line.
5. Now it ends with .patch :)

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_EditLogTailer.patch, 
> check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7730) knox-env.sh script should exit with proper error message , if JAVA is not set.

2015-02-02 Thread J.Andreina (JIRA)
J.Andreina created HDFS-7730:


 Summary: knox-env.sh script should exit with proper error message 
, if JAVA is not set. 
 Key: HDFS-7730
 URL: https://issues.apache.org/jira/browse/HDFS-7730
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: J.Andreina
Assignee: J.Andreina


knox-env.sh script  does not exit when JAVA is not set .

Hence execution of other script (which invokes knox-env.sh to set JAVA) in an 
environment which does not contains JAVA , continues with execution and logs 
non-user friendly messages as below
{noformat}
Execution of gateway.sh:

nohup: invalid option -- 'j'
Try `nohup --help' for more information.
{noformat}
{noformat}
Execution of knoxcli.sh :

./knoxcli.sh: line 61: -jar: command not found
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7709) Fix Findbug Warnings

2015-02-02 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302812#comment-14302812
 ] 

Rakesh R commented on HDFS-7709:


Thanks [~cmccabe] for the comments.

Found MAPREDUCE-6225 jira to fix 
https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html,
 so am skipping this part here.

Attached patch fixing {{hadoop-hdfs-httpfs}} and {{hadoop-rumen}}. Please 
review, Thanks!

> Fix Findbug Warnings
> 
>
> Key: HDFS-7709
> URL: https://issues.apache.org/jira/browse/HDFS-7709
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Rakesh R
> Attachments: HDFS-7709.patch
>
>
> There are many findbug warnings related to the warning types, 
> - DM_DEFAULT_ENCODING, 
> - RCN_REDUNDANT_NULLCHECK_OF_NONNULL_VALUE,
> - RCN_REDUNDANT_NULLCHECK_WOULD_HAVE_BEEN_A_NPE
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs-httpfs.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-rumen.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7709) Fix Findbug Warnings

2015-02-02 Thread Rakesh R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-7709:
---
Status: Patch Available  (was: Open)

> Fix Findbug Warnings
> 
>
> Key: HDFS-7709
> URL: https://issues.apache.org/jira/browse/HDFS-7709
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Rakesh R
> Attachments: HDFS-7709.patch
>
>
> There are many findbug warnings related to the warning types, 
> - DM_DEFAULT_ENCODING, 
> - RCN_REDUNDANT_NULLCHECK_OF_NONNULL_VALUE,
> - RCN_REDUNDANT_NULLCHECK_WOULD_HAVE_BEEN_A_NPE
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs-httpfs.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-rumen.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302811#comment-14302811
 ] 

Yongjun Zhang commented on HDFS-7707:
-

I found that the test failure of {{TestCommitBlockSynchronization}} is caused 
by incorrect behaviour of {{parent.addChild(file) }} on a mocked 
{{INodeDirectory}} instance {{parent}}. That is, after {{parent.addChild(file) 
}}, {{parent}} doesn't have the child {{file}}. I wonder if anyone knows why.

I did a workaround by creating a new {{INodeDirectory}} instance, and uploaded 
patch rev 002.

Thanks.


> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, HDFS-7707.002.patch, 
> reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7709) Fix Findbug Warnings

2015-02-02 Thread Rakesh R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-7709:
---
Attachment: HDFS-7709.patch

> Fix Findbug Warnings
> 
>
> Key: HDFS-7709
> URL: https://issues.apache.org/jira/browse/HDFS-7709
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Rakesh R
>Assignee: Rakesh R
> Attachments: HDFS-7709.patch
>
>
> There are many findbug warnings related to the warning types, 
> - DM_DEFAULT_ENCODING, 
> - RCN_REDUNDANT_NULLCHECK_OF_NONNULL_VALUE,
> - RCN_REDUNDANT_NULLCHECK_WOULD_HAVE_BEEN_A_NPE
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs-httpfs.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-rumen.html
> https://builds.apache.org/job/PreCommit-HADOOP-Build/5542//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7707:

Attachment: HDFS-7707.002.patch

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, HDFS-7707.002.patch, 
> reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7630) TestConnCache hardcode block size without considering native OS

2015-02-02 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302799#comment-14302799
 ] 

Arpit Agarwal commented on HDFS-7630:
-

Hi Sam, I understand your concern but I think for these tests it is not an 
issue. The tests should have no dependency on OS page size.

I believe the choice of 4096 as the block size was arbitrary and it being the 
same as the default page size on x86/x64 is just a coincidence.



> TestConnCache hardcode block size without considering native OS
> ---
>
> Key: HDFS-7630
> URL: https://issues.apache.org/jira/browse/HDFS-7630
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: sam liu
>Assignee: sam liu
> Attachments: HDFS-7630.001.patch, HDFS-7630.002.patch
>
>
> TestConnCache hardcode block size with 'BLOCK_SIZE = 4096', however it's 
> incorrect on some platforms. For example, on power platform, the correct 
> value is 65536.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7729) Add logic to DFSOutputStream to support writing a file in striping layout

2015-02-02 Thread Li Bo (JIRA)
Li Bo created HDFS-7729:
---

 Summary: Add logic to DFSOutputStream to support writing a file in 
striping layout 
 Key: HDFS-7729
 URL: https://issues.apache.org/jira/browse/HDFS-7729
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo


If client wants to directly write a file striping layout, we need to add some 
logic to DFSOutputStream.  DFSOutputStream needs multiple DataStreamers to 
write each cell of a stripe to a remote datanode. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6753) When one the Disk is full and all the volumes configured are unhealthy , then Datanode is not considering it as failure and datanode process is not shutting down .

2015-02-02 Thread Srikanth Upputuri (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302784#comment-14302784
 ] 

Srikanth Upputuri commented on HDFS-6753:
-

A write request to DN will first check for a disk volume with available space 
then proceeds to create a rbw file on it. The 'check disk error' is triggered 
when the rbw file can not be created. But if a volume with sufficient space 
could not be found, the request just throws an exception without initiating 
'check disk error'. This is reasonable to expect because if there is no space 
available on any volume, DN may still be able to service read requests, so 'not 
enough space' is not a sufficient condition for DN shutdown. However, if after 
this condition all the volumes happen to become faulty, a subsequent read 
request will detect this condition and shutdown DN anyway. Therefore there is 
no need to fix this behavior.

> When one the Disk is full and all the volumes configured are unhealthy , then 
> Datanode is not considering it as failure and datanode process is not 
> shutting down .
> ---
>
> Key: HDFS-6753
> URL: https://issues.apache.org/jira/browse/HDFS-6753
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: J.Andreina
>Assignee: Srikanth Upputuri
>
> Env Details :
> =
> Cluster has 3 Datanode
> Cluster installed with "Rex" user
> dfs.datanode.failed.volumes.tolerated  = 3
> dfs.blockreport.intervalMsec  = 18000
> dfs.datanode.directoryscan.interval = 120
> DN_XX1.XX1.XX1.XX1 data dir = 
> /mnt/tmp_Datanode,/home/REX/data/dfs1/data,/home/REX/data/dfs2/data,/opt/REX/dfs/data
>  
>  
> /home/REX/data/dfs1/data,/home/REX/data/dfs2/data,/opt/REX/dfs/data - 
> permission is denied ( hence DN considered the volume as failed )
>  
> Expected behavior is observed when disk is not full:
> 
>  
> Step 1: Change the permissions of /mnt/tmp_Datanode to root
>  
> Step 2: Perform write operations ( DN detects that all Volume configured is 
> failed and gets shutdown )
>  
> Scenario 1: 
> ===
>  
> Step 1 : Make /mnt/tmp_Datanode disk full and change the permissions to root
> Step 2 : Perform client write operations ( disk full exception is thrown , 
> but Datanode is not getting shutdown ,  eventhough all the volume configured 
> has failed)
>  
> {noformat}
>  
> 2014-07-21 14:10:52,814 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> XX1.XX1.XX1.XX1:50010:DataXceiver error processing WRITE_BLOCK operation  
> src: /XX2.XX2.XX2.XX2:10106 dst: /XX1.XX1.XX1.XX1:50010
>  
> org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of space: The 
> volume with the most available space (=4096 B) is less than the block size 
> (=134217728 B).
>  
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy.chooseVolume(RoundRobinVolumeChoosingPolicy.java:60)
>  
> {noformat}
>  
> Observations :
> ==
> 1. Write operations does not shutdown Datanode , eventhough all the volume 
> configured is failed ( When one of the disk is full and for all the disk 
> permission is denied)
>  
> 2. Directory scannning fails , still DN is not getting shutdown
>  
>  
>  
> {noformat}
>  
> 2014-07-21 14:13:00,180 WARN 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Exception occured 
> while compiling report: 
>  
> java.io.IOException: Invalid directory or I/O error occurred for dir: 
> /mnt/tmp_Datanode/current/BP-1384489961-XX2.XX2.XX2.XX2-845784615183/current/finalized
>  
> at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1164)
>  
> at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:596)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7630) TestConnCache hardcode block size without considering native OS

2015-02-02 Thread sam liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302777#comment-14302777
 ] 

sam liu commented on HDFS-7630:
---

Hi Arpit, they pass on power platform so far, but it will be better if we could 
consider the differences of the page size of operating systems in regarding 
tests.

> TestConnCache hardcode block size without considering native OS
> ---
>
> Key: HDFS-7630
> URL: https://issues.apache.org/jira/browse/HDFS-7630
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: sam liu
>Assignee: sam liu
> Attachments: HDFS-7630.001.patch, HDFS-7630.002.patch
>
>
> TestConnCache hardcode block size with 'BLOCK_SIZE = 4096', however it's 
> incorrect on some platforms. For example, on power platform, the correct 
> value is 65536.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-7137) HDFS Federation -- Adding a new Namenode to an existing HDFS cluster Document Has an Error

2015-02-02 Thread Kiran Kumar M R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kiran Kumar M R resolved HDFS-7137.
---
   Resolution: Duplicate
Fix Version/s: 3.0.0

Closed as resolved as patch given in HDFS-7667 fixes this issue

> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster 
> Document Has an Error 
> 
>
> Key: HDFS-7137
> URL: https://issues.apache.org/jira/browse/HDFS-7137
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: zhangyubiao
>Assignee: Kiran Kumar M R
>Priority: Minor
>  Labels: documentation
> Fix For: 3.0.0
>
>
> In Document 
> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster
> > $HADOOP_PREFIX_HOME/bin/hdfs dfadmin -refreshNameNode 
> > : 
> should be 
> > $HADOOP_PREFIX_HOME/bin/hdfs dfsadmin -refreshNameNode 
> > : 
> It just miss s in dfadmin 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7137) HDFS Federation -- Adding a new Namenode to an existing HDFS cluster Document Has an Error

2015-02-02 Thread Kiran Kumar M R (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302761#comment-14302761
 ] 

Kiran Kumar M R commented on HDFS-7137:
---

The fix for this issue is already given in HDFS-7667

> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster 
> Document Has an Error 
> 
>
> Key: HDFS-7137
> URL: https://issues.apache.org/jira/browse/HDFS-7137
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: zhangyubiao
>Assignee: Kiran Kumar M R
>Priority: Minor
>  Labels: documentation
>
> In Document 
> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster
> > $HADOOP_PREFIX_HOME/bin/hdfs dfadmin -refreshNameNode 
> > : 
> should be 
> > $HADOOP_PREFIX_HOME/bin/hdfs dfsadmin -refreshNameNode 
> > : 
> It just miss s in dfadmin 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-7137) HDFS Federation -- Adding a new Namenode to an existing HDFS cluster Document Has an Error

2015-02-02 Thread Kiran Kumar M R (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kiran Kumar M R reassigned HDFS-7137:
-

Assignee: Kiran Kumar M R

> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster 
> Document Has an Error 
> 
>
> Key: HDFS-7137
> URL: https://issues.apache.org/jira/browse/HDFS-7137
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Reporter: zhangyubiao
>Assignee: Kiran Kumar M R
>Priority: Minor
>  Labels: documentation
>
> In Document 
> HDFS Federation  -- Adding a new Namenode to an existing HDFS cluster
> > $HADOOP_PREFIX_HOME/bin/hdfs dfadmin -refreshNameNode 
> > : 
> should be 
> > $HADOOP_PREFIX_HOME/bin/hdfs dfsadmin -refreshNameNode 
> > : 
> It just miss s in dfadmin 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302737#comment-14302737
 ] 

Hadoop QA commented on HDFS-7707:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696056/HDFS-7707.001.patch
  against trunk revision 8cb4731.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.TestCommitBlockSynchronization

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9406//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9406//console

This message is automatically generated.

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7018) Implement C interface for libhdfs3

2015-02-02 Thread Zhanwei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302724#comment-14302724
 ] 

Zhanwei Wang commented on HDFS-7018:


add new patch:
1) revert the API change of {{hdfsOpenFile}} in {{hdfs.h}}
2) improve the comment of {{hdfsGetLastError}}
3) rename {{Strdup}} to {{CopyString}}

> Implement C interface for libhdfs3
> --
>
> Key: HDFS-7018
> URL: https://issues.apache.org/jira/browse/HDFS-7018
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Zhanwei Wang
>Assignee: Zhanwei Wang
> Attachments: HDFS-7018-pnative.002.patch, 
> HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
> HDFS-7018-pnative.005.patch, HDFS-7018.patch
>
>
> Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7018) Implement C interface for libhdfs3

2015-02-02 Thread Zhanwei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanwei Wang updated HDFS-7018:
---
Attachment: HDFS-7018-pnative.005.patch

> Implement C interface for libhdfs3
> --
>
> Key: HDFS-7018
> URL: https://issues.apache.org/jira/browse/HDFS-7018
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Zhanwei Wang
>Assignee: Zhanwei Wang
> Attachments: HDFS-7018-pnative.002.patch, 
> HDFS-7018-pnative.003.patch, HDFS-7018-pnative.004.patch, 
> HDFS-7018-pnative.005.patch, HDFS-7018.patch
>
>
> Implement C interface for libhdfs3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302707#comment-14302707
 ] 

Hadoop QA commented on HDFS-7728:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696062/HDFS-7728.000.patch
  against trunk revision 8cb4731.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
  org.apache.hadoop.fs.TestHDFSFileContextMainOperations

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestDistributedFileSystem

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9407//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9407//console

This message is automatically generated.

> Avoid updating quota usage while loading edits
> --
>
> Key: HDFS-7728
> URL: https://issues.apache.org/jira/browse/HDFS-7728
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-7728.000.patch
>
>
> Per the discussion 
> [here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
>  currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot 
> deletion, even if this is still in the edits loading process. This is 
> unnecessary and can cause issue like HDFS-7611. We should collect quota 
> change and call {{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7630) TestConnCache hardcode block size without considering native OS

2015-02-02 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302674#comment-14302674
 ] 

Arpit Agarwal commented on HDFS-7630:
-

Hi Sam, thank you for the clarification. Let's fix only those tests which are 
confirmed to fail due to OS page size assumptions. 

Do any of the tests in this patch fail on power platform without the change?

> TestConnCache hardcode block size without considering native OS
> ---
>
> Key: HDFS-7630
> URL: https://issues.apache.org/jira/browse/HDFS-7630
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: sam liu
>Assignee: sam liu
> Attachments: HDFS-7630.001.patch, HDFS-7630.002.patch
>
>
> TestConnCache hardcode block size with 'BLOCK_SIZE = 4096', however it's 
> incorrect on some platforms. For example, on power platform, the correct 
> value is 65536.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7712) Switch blockStateChangeLog to use slf4j

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302672#comment-14302672
 ] 

Hadoop QA commented on HDFS-7712:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696031/hdfs-7712.004.patch
  against trunk revision 8acc5e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1251 javac 
compiler warnings (more than the trunk's current 1190 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
  
org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9402//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9402//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9402//console

This message is automatically generated.

> Switch blockStateChangeLog to use slf4j
> ---
>
> Key: HDFS-7712
> URL: https://issues.apache.org/jira/browse/HDFS-7712
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Minor
> Attachments: hdfs-7712.001.patch, hdfs-7712.002.patch, 
> hdfs-7712.003.patch, hdfs-7712.004.patch
>
>
> As pointed out in HDFS-7706, updating blockStateChangeLog to use slf4j will 
> save a lot of string construction costs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7721) The HDFS BlockScanner may run fast during the first hour

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302671#comment-14302671
 ] 

Hadoop QA commented on HDFS-7721:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696032/HDFS-7721.001.patch
  against trunk revision 8acc5e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1253 javac 
compiler warnings (more than the trunk's current 1190 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.namenode.TestFileTruncate
  
org.apache.hadoop.hdfs.server.namenode.ha.TestDFSZKFailoverController

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.cli.TestCacheAdminCLI

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9403//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9403//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9403//console

This message is automatically generated.

> The HDFS BlockScanner may run fast during the first hour
> 
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7715) Implement the Hitchhiker erasure coding algorithm

2015-02-02 Thread dandantu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302663#comment-14302663
 ] 

dandantu commented on HDFS-7715:


Ok, Thank you

> Implement the Hitchhiker erasure coding algorithm
> -
>
> Key: HDFS-7715
> URL: https://issues.apache.org/jira/browse/HDFS-7715
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Zhe Zhang
>Assignee: dandantu
>
> [Hitchhiker | 
> http://www.eecs.berkeley.edu/~nihar/publications/Hitchhiker_SIGCOMM14.pdf] is 
> a new erasure coding algorithm developed as a research project at UC 
> Berkeley. It has been shown to reduce network traffic and disk I/O by 25% and 
> 45% during data reconstruction. This JIRA aims to introduce Hitchhiker to the 
> HDFS-EC framework, as one of the pluggable codec algorithms.
> The existing implementation is based on HDFS-RAID. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302655#comment-14302655
 ] 

Hadoop QA commented on HDFS-7726:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12696023/check_config_val_EditLogTailer.patch.1
  against trunk revision 8acc5e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestHAStateTransitions

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9401//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9401//console

This message is automatically generated.

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7613) Block placement policy for erasure coding groups

2015-02-02 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302643#comment-14302643
 ] 

Zhe Zhang commented on HDFS-7613:
-

[~Ditta] Thanks again for your interest. As [~zhangyongxyz] has already started 
working on this issue, please share your thoughts under this JIRA.

> Block placement policy for erasure coding groups
> 
>
> Key: HDFS-7613
> URL: https://issues.apache.org/jira/browse/HDFS-7613
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Zhe Zhang
>Assignee: Yong Zhang
>
> Blocks in an erasure coding group should be placed in different failure 
> domains -- different DataNodes at the minimum, and different racks ideally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-7613) Block placement policy for erasure coding groups

2015-02-02 Thread Yong Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Zhang reassigned HDFS-7613:


Assignee: Yong Zhang  (was: Zhe Zhang)

> Block placement policy for erasure coding groups
> 
>
> Key: HDFS-7613
> URL: https://issues.apache.org/jira/browse/HDFS-7613
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Zhe Zhang
>Assignee: Yong Zhang
>
> Blocks in an erasure coding group should be placed in different failure 
> domains -- different DataNodes at the minimum, and different racks ideally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7718) DFSClient objects created by AbstractFileSystem objects created by FileContext are not closed and results in thread leakage

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302627#comment-14302627
 ] 

Hadoop QA commented on HDFS-7718:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696019/HDFS-7718.2.patch
  against trunk revision 8acc5e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9400//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9400//console

This message is automatically generated.

> DFSClient objects created by AbstractFileSystem objects created by 
> FileContext are not closed and results in thread leakage
> ---
>
> Key: HDFS-7718
> URL: https://issues.apache.org/jira/browse/HDFS-7718
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7718.1.patch, HDFS-7718.2.patch
>
>
> Currently, the {{FileContext}} class used by clients such as (for eg. 
> {{YARNRunner}}) creates a new {{AbstractFilesystem}} object on 
> initialization.. which creates a new {{DFSClient}} object.. which in turn 
> creates a KeyProvider object.. If Encryption is turned on, and https is 
> turned on, the keyprovider implementation (the {{KMSClientProvider}}) will 
> create a {{ReloadingX509TrustManager}} thread per instance... which are never 
> killed and can lead to a thread leak



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302616#comment-14302616
 ] 

Zhe Zhang commented on HDFS-7726:
-

Good catch [~tianyin]. It makes sense to catch config errors at initiation time.

# I think we should check {{rpcTimeout}} is positive as well
# Shouldn't we use {{Preconditions.checkArgument}}?

Nits:
# We usually should keep each line below 80 chars
# Lines 7~8 of the patch are unnecessary
# It'd be nice if the patch is suffixed '.patch' :)

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302609#comment-14302609
 ] 

Yi Liu commented on HDFS-7285:
--

{quote}
# Once the client accumulates 6*64KB data, it does not flush the data to the 
DNs. The client buffers the data and starts buffering the next 6*64KB stripe.
# Once the client accumulates 1024 / 64 = 16 stripes – that is 1MB for each DN 
– it flushes out the data to DNs.
# Once the data flushed to each DN reaches 128MB – that is 128MB * 6 = 768MB 
data overall – it allocates a new block group from NN.
{quote}
Yes, it makes sense now. Thanks.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302602#comment-14302602
 ] 

Kai Zheng commented on HDFS-7285:
-

I'm not sure storage policy can cover all the cases and forms we're going to 
support considering strip support. I guess EC zone might not hurt. You're right 
about restriction for a EC zone, yes a file in a zone should not move outside 
or elsewhere without necessary transforming first.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302598#comment-14302598
 ] 

Yi Liu commented on HDFS-7285:
--

{quote}
Yes we have EC zones, each zone actually represents a folder path and 
associates with an EC schema. Using the schema all the files in the zone will 
be in the form defined by it.
{quote}
OK, I see. It's fine for me. It's similar with storage policies for directories 
and files, and I think we don't need the concept of zone here. It's impressed 
me that we have restriction for the zone, for example, for an encryption zone, 
the files can't rename to other folders outside the zone and so on.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302591#comment-14302591
 ] 

Zhe Zhang commented on HDFS-7285:
-

bq. I think it's incorrect. For example, we have a file, and it's length is 
128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we need 
(128*1024K)/(6*64K) = 342 block groups. 
Aah I see where the confusion came from. Sorry that the design doc didn't 
explain clearly the different parameters. When the client writes to a striped 
file, the following 3 events happen:
# Once the client accumulates 6*64KB data, it does _not_ flush the data to the 
DNs. The client buffers the data and starts buffering the next 6*64KB stripe.
# Once the client accumulates {{1024 / 64 = 16}} stripes -- that is 1MB for 
each DN -- it flushes out the data to DNs.
# Once the data flushed to each DN reaches 128MB -- that is {{128MB * 6 = 
768MB}} data overall -- it allocates a *new block group* from NN.

Section 2.1 of the QFS [paper | 
http://www.vldb.org/pvldb/vol6/p1092-ovsiannikov.pdf] has a pretty detailed 
explanation too. 

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7728:

Attachment: HDFS-7728.000.patch

> Avoid updating quota usage while loading edits
> --
>
> Key: HDFS-7728
> URL: https://issues.apache.org/jira/browse/HDFS-7728
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-7728.000.patch
>
>
> Per the discussion 
> [here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
>  currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot 
> deletion, even if this is still in the edits loading process. This is 
> unnecessary and can cause issue like HDFS-7611. We should collect quota 
> change and call {{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7728:

Status: Patch Available  (was: Open)

> Avoid updating quota usage while loading edits
> --
>
> Key: HDFS-7728
> URL: https://issues.apache.org/jira/browse/HDFS-7728
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-7728.000.patch
>
>
> Per the discussion 
> [here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
>  currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot 
> deletion, even if this is still in the edits loading process. This is 
> unnecessary and can cause issue like HDFS-7611. We should collect quota 
> change and call {{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7520) checknative should display a nicer error message when openssl support is not compiled in

2015-02-02 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302587#comment-14302587
 ] 

Anu Engineer commented on HDFS-7520:


-1 tests included. - Since this moves a BUILD variable definition to proper 
location no new tests are really required. This was tested by building native 
builds with this change.

> checknative should display a nicer error message when openssl support is not 
> compiled in
> 
>
> Key: HDFS-7520
> URL: https://issues.apache.org/jira/browse/HDFS-7520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Colin Patrick McCabe
>Assignee: Anu Engineer
> Attachments: HDFS-7520.001.patch
>
>
> checknative should display a nicer error message when openssl support is not 
> compiled in.  Currently, it displays this:
> {code}
> [cmccabe@keter hadoop]$ hadoop checknative
> 14/12/12 14:08:43 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 14/12/12 14:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib64/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: false org.apache.hadoop.crypto.OpensslCipher.initIDs()V
> {code}
> Instead, we should display something like this, if openssl is not supported 
> by the current build:
> {code}
> openssl: false Hadoop was built without openssl support.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7728:

Attachment: (was: HDFS-7728.000.patch)

> Avoid updating quota usage while loading edits
> --
>
> Key: HDFS-7728
> URL: https://issues.apache.org/jira/browse/HDFS-7728
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-7728.000.patch
>
>
> Per the discussion 
> [here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
>  currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot 
> deletion, even if this is still in the edits loading process. This is 
> unnecessary and can cause issue like HDFS-7611. We should collect quota 
> change and call {{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7349) Support DFS command for the EC encoding

2015-02-02 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302584#comment-14302584
 ] 

Kai Zheng commented on HDFS-7349:
-

Sorry I missed this. Will look at it today.

> Support DFS command for the EC encoding
> ---
>
> Key: HDFS-7349
> URL: https://issues.apache.org/jira/browse/HDFS-7349
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Vinayakumar B
>Assignee: Vinayakumar B
> Attachments: HDFS-7349-001.patch, HDFS-7349-002.patch
>
>
> Support implementation of the following commands
> *hdfs dfs -convertToEC *
>: Converts all blocks under this path to EC form (if not already in 
> EC form, and if can be coded).
> *hdfs dfs -convertToRep *
>: Converts all blocks under this path to be replicated form.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302585#comment-14302585
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7411:
---

> I think it's pretty common for us to change the behavior of the system when 
> the behavior change is a strict improvement. Keeping around inferior behavior 
> just for the purpose of consistency seems rather pointless.

First of all, we never have showed that the new behavior is strictly better.  
It is a hypothesis.  No?

> For example, it used to be the case that fsimage transfers ...

The fsimage transfer change is an internal protocol change.  The http interface 
is not a public API.  However, the conf property discussed here is public.

> Similarly, when we find ways that CPU performance or memory usage in the NN 
> can be improved, ...

The change here is not like that.  It changes the scheme from node based to 
block based.  It is not making the node based decommission faster.

> As Andrew Wang has already described, the new behavior should be both more 
> performant and more predictable. ...

As mentioned previously, it is a hypothesis. That why Andrew described as 
"should be".

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302583#comment-14302583
 ] 

Kai Zheng commented on HDFS-7285:
-

Yes we have EC zones, each zone actually represents a folder path and 
associates with an EC schema. Using the schema all the files in the zone will 
be in the form defined by it.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6651) Deletion failure can leak inodes permanently

2015-02-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302576#comment-14302576
 ] 

Hudson commented on HDFS-6651:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6985 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6985/])
HDFS-6651. Deletion failure can leak inodes permanently. Contributed by Jing 
Zhao. (wheat9: rev 8cb473124c1cf1c6f68ead7bde06558ebf7ce47e)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestXAttrWithSnapshot.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectoryWithSnapshotFeature.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestRenameWithSnapshots.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/DirectorySnapshottableFeature.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestAclWithSnapshot.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestSnapshotDeletion.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FileWithSnapshotFeature.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeSymlink.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirRenameOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirDeleteOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INode.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeWithAdditionalFields.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeFile.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeDirectory.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeMap.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/snapshot/TestNestedSnapshots.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/INodeReference.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/FileDiffList.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/AbstractINodeDiffList.java


> Deletion failure can leak inodes permanently
> 
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Fix For: 2.7.0
>
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302560#comment-14302560
 ] 

Yi Liu edited comment on HDFS-7285 at 2/3/15 12:39 AM:
---

{quote}
Small strip cell size for small files in a zone, and large strip cell size for 
large files in another zone
{quote}
Right, for large file, using large stripe cell size can decrease NN memory 
consumption. Otherwise, ec feature will cause big issue for NN memory.
BTW, I have one thing not very clear, we need the concept of "zone"? The 
definition of "zone" is?


was (Author: hitliuyi):
{quote}
Small strip cell size for small files in a zone, and large strip cell size for 
large files in another zone
{quote}
Right, for large file, using large stripe cell size can decrease NN memory 
consumption. Otherwise, ec feature will cause big issue for NN memory.
BTW, we have one thing not very clear, we need the concept of "zone"? The 
definition of "zone" is?

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302550#comment-14302550
 ] 

Yi Liu edited comment on HDFS-7285 at 2/3/15 12:37 AM:
---

{quote}
The number of block groups is actually unrelated to the cell size (e.g. 64KB). 
For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 
block group.
A smaller cell size better handles small files. But data locality is degraded – 
for example, it might be hard to fit MapReduce records into 64KB cells.
{quote}
I think it's incorrect for normal file. For example, we have a file, and it's 
length is 128M. If we use 6+3 schema, and ec stripe cell size is 64K, then we 
need (128*1024K)/(6*64K) = 342 block groups. But if the ec stripe cell size is 
8M, then we need 128/6*8 = 3 block groups. 
Obviously, small stripe cell size will cause much more NN memory for normal/big 
file, even we only store the first ec block of the ec block groups in NN. 


was (Author: hitliuyi):
{quote}
The number of block groups is actually unrelated to the cell size (e.g. 64KB). 
For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 
block group.
A smaller cell size better handles small files. But data locality is degraded – 
for example, it might be hard to fit MapReduce records into 64KB cells.
{quote}
I think it's incorrect. For example, we have a file, and it's length is 128M. 
If we use 6+3 schema, and ec stripe cell size is 64K, then we need 
(128*1024K)/(6*64K) = 342 block groups. But if the ec stripe cell size is 8M, 
then we need 128/6*8 = 3 block groups. 
Obviously, small stripe cell size will cause much more NN memory, even we only 
store the first ec block of the ec block groups in NN. 

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6651) Deletion failure can leak inodes permanently

2015-02-02 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-6651:
-
Hadoop Flags: Incompatible change,Reviewed

> Deletion failure can leak inodes permanently
> 
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Fix For: 2.7.0
>
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6651) Deletion failure can leak inodes permanently

2015-02-02 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-6651:
-
   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I've committed the patch to trunk and branch-2. Thanks [~jingzhao] for the 
contribution.

> Deletion failure can leak inodes permanently
> 
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Fix For: 2.7.0
>
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302560#comment-14302560
 ] 

Yi Liu commented on HDFS-7285:
--

{quote}
Small strip cell size for small files in a zone, and large strip cell size for 
large files in another zone
{quote}
Right, for large file, using large stripe cell size can decrease NN memory 
consumption. Otherwise, ec feature will cause big issue for NN memory.
BTW, we have one thing not very clear, we need the concept of "zone"? The 
definition of "zone" is?

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7707:

Attachment: HDFS-7707.001.patch

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-7728:

Attachment: HDFS-7728.000.patch

The initial patch for review.

The main challenge here is that for an INodeReference instance (along with the 
subtree underneath) we have to double count its quota usage. Thus the patch 
defines a QuotaDelta class which contains quota usage updates along all the 
necessary paths, and pass it to {{INode#cleanSubtree}} and 
{{INode#destroyAndCollectBlocks}}. 

> Avoid updating quota usage while loading edits
> --
>
> Key: HDFS-7728
> URL: https://issues.apache.org/jira/browse/HDFS-7728
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-7728.000.patch
>
>
> Per the discussion 
> [here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
>  currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot 
> deletion, even if this is still in the edits loading process. This is 
> unnecessary and can cause issue like HDFS-7611. We should collect quota 
> change and call {{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-6651) Deletion failure can leak inodes permanently

2015-02-02 Thread Haohui Mai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-6651:
-
Summary: Deletion failure can leak inodes permanently  (was: Deletion 
failure can leak inodes permanently.)

> Deletion failure can leak inodes permanently
> 
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7707:

Attachment: (was: HDFS-7707.001.patch)

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6651) Deletion failure can leak inodes permanently.

2015-02-02 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302553#comment-14302553
 ] 

Haohui Mai commented on HDFS-6651:
--

+1. I'll commit it shortly.

> Deletion failure can leak inodes permanently.
> -
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302551#comment-14302551
 ] 

Hadoop QA commented on HDFS-7707:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696055/HDFS-7707.001.patch
  against trunk revision 8acc5e9.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9405//console

This message is automatically generated.

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302550#comment-14302550
 ] 

Yi Liu commented on HDFS-7285:
--

{quote}
The number of block groups is actually unrelated to the cell size (e.g. 64KB). 
For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 
block group.
A smaller cell size better handles small files. But data locality is degraded – 
for example, it might be hard to fit MapReduce records into 64KB cells.
{quote}
I think it's incorrect. For example, we have a file, and it's length is 128M. 
If we use 6+3 schema, and ec stripe cell size is 64K, then we need 
(128*1024K)/(6*64K) = 342 block groups. But if the ec stripe cell size is 8M, 
then we need 128/6*8 = 3 block groups. 
Obviously, small stripe cell size will cause much more NN memory, even we only 
store the first ec block of the ec block groups in NN. 

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302552#comment-14302552
 ] 

Yongjun Zhang commented on HDFS-7707:
-

Hi Kihwal,

I submitted patch rev 001 per the solution described in my last comment. Would 
you please help taking a look? thanks a lot!



> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7707:

Status: Patch Available  (was: Open)

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7728) Avoid updating quota usage while loading edits

2015-02-02 Thread Jing Zhao (JIRA)
Jing Zhao created HDFS-7728:
---

 Summary: Avoid updating quota usage while loading edits
 Key: HDFS-7728
 URL: https://issues.apache.org/jira/browse/HDFS-7728
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Jing Zhao
Assignee: Jing Zhao


Per the discussion 
[here|https://issues.apache.org/jira/browse/HDFS-7611?focusedCommentId=14292454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14292454],
 currently we call {{INode#addSpaceConsumed}} while file/dir/snapshot deletion, 
even if this is still in the edits loading process. This is unnecessary and can 
cause issue like HDFS-7611. We should collect quota change and call 
{{FSDirectory#updateCount}} at the end of the operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7707:

Attachment: HDFS-7707.001.patch

> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: HDFS-7707.001.patch, reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7727) Check and verify the auto-fence settings to prevent failures of auto-failover

2015-02-02 Thread Tianyin Xu (JIRA)
Tianyin Xu created HDFS-7727:


 Summary: Check and verify the auto-fence settings to prevent 
failures of auto-failover
 Key: HDFS-7727
 URL: https://issues.apache.org/jira/browse/HDFS-7727
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.5.1, 2.6.0, 2.4.1
Reporter: Tianyin Xu


Sorry for reporting similar problems, but the problems resides in different 
components, and this one has more severe consequence (well, this's my last 
report of this type of problems). 


Problem
-
The problem is similar as the following issues resolved in Yarn,
https://issues.apache.org/jira/browse/YARN-2165
https://issues.apache.org/jira/browse/YARN-2166
and reported (by me) in HDFS EditLogTailer,
https://issues.apache.org/jira/browse/HDFS-7726

Basically, the configuration settings is not checked and verified at 
initialization but directly parsed and applied at runtime. Any configuration 
errors would impair the corresponding components (since the exceptions are not 
caught). 

In this case, the values are used in auto-failover so you won't notice the 
errors until one of the NameNode fails and triggers the fence procedure in the 
auto-failover process.


Parameters
-

In SSHFence, there are two configuration parameters defined in 
SshFenceByTcpPort.java
"dfs.ha.fencing.ssh.connect-timeout";
"dfs.ha.fencing.ssh.private-key-files"

They are used in the tryFence() function for auto-fencing. 

Any erroneous settings of these two parameters would result in uncaught 
exceptions that would prevent the fencing and impair autofailover. We have 
verified this by setting a two-NameNode autofailover cluster and manually kill 
the active NameNode. The passive NameNode cannot takeover successfully. 

For "dfs.ha.fencing.ssh.connect-timeout", the erroneous settings include 
ill-formatted integers and negative integers for 
dfs.ha.fencing.ssh.connect-timeout (it is used for Thread.join()).

For "dfs.ha.fencing.ssh.private-key-files",  the erroneous settings include 
non-existent private-key file path or wrong permissions that fail 
jsch.addIdentity() in the createSession() method.

I think actively checking the settings in the constructor of the class (in the 
same way as YARN-2165, YARN-2166, HDFS-7726) should be able to fix the problems.

Thanks! 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7520) checknative should display a nicer error message when openssl support is not compiled in

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302542#comment-14302542
 ] 

Hadoop QA commented on HDFS-7520:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12696037/HDFS-7520.001.patch
  against trunk revision 8acc5e9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9404//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9404//console

This message is automatically generated.

> checknative should display a nicer error message when openssl support is not 
> compiled in
> 
>
> Key: HDFS-7520
> URL: https://issues.apache.org/jira/browse/HDFS-7520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Colin Patrick McCabe
>Assignee: Anu Engineer
> Attachments: HDFS-7520.001.patch
>
>
> checknative should display a nicer error message when openssl support is not 
> compiled in.  Currently, it displays this:
> {code}
> [cmccabe@keter hadoop]$ hadoop checknative
> 14/12/12 14:08:43 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 14/12/12 14:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib64/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: false org.apache.hadoop.crypto.OpensslCipher.initIDs()V
> {code}
> Instead, we should display something like this, if openssl is not supported 
> by the current build:
> {code}
> openssl: false Hadoop was built without openssl support.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6651) Deletion failure can leak inodes permanently.

2015-02-02 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302540#comment-14302540
 ] 

Jing Zhao commented on HDFS-6651:
-

The current patch takes the #5 solution. This simplifies the quota calculation 
and avoids the inode leaking while deletion. Since the current snapshot 
solution takes COW semantic, the diffs may not contribute a lot to the NS quota 
usage and NN memory usage. Thus I think the incompatibility here may not be an 
issue.

> Deletion failure can leak inodes permanently.
> -
>
> Key: HDFS-6651
> URL: https://issues.apache.org/jira/browse/HDFS-6651
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Jing Zhao
>Priority: Critical
> Attachments: HDFS-6651.000.patch, HDFS-6651.001.patch, 
> HDFS-6651.002.patch
>
>
> As discussed in HDFS-6618, if a deletion of tree fails in the middle, any 
> collected inodes and blocks will not be removed from {{INodeMap}} and 
> {{BlocksMap}}. 
> Since fsimage is saved by iterating over {{INodeMap}}, the leak will persist 
> across name node restart. Although blanked out inodes will not have reference 
> to blocks, blocks will still refer to the inode as {{BlockCollection}}. As 
> long as it is not null, blocks will live on. The leaked blocks from blanked 
> out inodes will go away after restart.
> Options (when delete fails in the middle)
> - Complete the partial delete: edit log the partial delete and remove inodes 
> and blocks. 
> - Somehow undo the partial delete.
> - Check quota for snapshot diff beforehand for the whole subtree.
> - Ignore quota check during delete even if snapshot is present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302513#comment-14302513
 ] 

Aaron T. Myers commented on HDFS-7411:
--

bq. Users are familiar with the existing behavior so that we have to keep the 
old code, deprecate them first and then remove them later. It is a standard 
procedure for changing the behavior of a feature. No?

I think it's pretty common for us to change the behavior of the system when the 
behavior change is a strict improvement. Keeping around inferior behavior just 
for the purpose of consistency seems rather pointless.

For example, it used to be the case that fsimage transfers from the secondary 
or standby NN happened with a GET request from the standby to the active, which 
then triggered another GET request back from the active to the standby. When we 
did the work to make the transfer work with just a single POST request from the 
standby to the active, we didn't bother keeping around the old GET/GET behavior 
just because people were used to it.

Similarly, when we find ways that CPU performance or memory usage in the NN can 
be improved, we don't keep around the old ways of doing things just because 
operators might be used to the NN being slow or using a lot of memory.

bq. -1 on the patch.

As [~andrew.wang] has already described, the new behavior should be both more 
performant and more predictable. I don't know see why we'd want to keep around 
decommissioning behavior that can be both erratic and slower.

Andrew already quite graciously amended the patch to preserve the backward 
compatibility of {{dfs.namenode.decommission.nodes.per.interval}} per your 
request, [~szetszwo]. I would hope that would have satisfied your feedback. 
Please consider withdrawing your -1.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302510#comment-14302510
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7411:
---

> ... the old limiting scheme is seriously flawed. ...
> ...  The old logic is seriously flawed...

Sure, you see it this way but some users may feel that the old code works just 
fine.  They may not have time to deal with new behavior.  We cannot force them 
to do so.

> There is no benefit to keeping around multiple broken implementations of 
> things to do the same job. ...

We are not keeping multiple implementations.  The old implementation will be 
removed in the future.

> ...  It's ready to go in, and I think it should. +1. Let's commit this today 
> if there are no other comments about the patch.

Let me clarify my -1.  The patch changes an existing conf property to a 
different behavior.  Instead, we should keep the existing behavior, deprecate 
the conf property first and then remove it later.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-7122) Use of ThreadLocal results in poor block placement

2015-02-02 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang reassigned HDFS-7122:
-

Assignee: Andrew Wang  (was: Jonathan Lawlor)

> Use of ThreadLocal results in poor block placement
> --
>
> Key: HDFS-7122
> URL: https://issues.apache.org/jira/browse/HDFS-7122
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.3.0
> Environment: medium-large environments with 100's to 1000's of DNs 
> will be most affected, but potentially all environments.
>Reporter: Jeff Buell
>Assignee: Andrew Wang
>Priority: Blocker
>  Labels: performance
> Fix For: 2.6.0
>
> Attachments: copies_per_slave.jpg, 
> hdfs-7122-cdh5.1.2-testing.001.patch, hdfs-7122.001.patch
>
>
> Summary:
> Since HDFS-6268, the distribution of replica block copies across the 
> DataNodes (replicas 2,3,... as distinguished from the first "primary" 
> replica) is extremely poor, to the point that TeraGen slows down by as much 
> as 3X for certain configurations.  This is almost certainly due to the 
> introduction of Thread Local Random in HDFS-6268.  The mechanism appears to 
> be that this change causes all the random numbers in the threads to be 
> correlated, thus preventing a truly random choice of DN for each replica copy.
> Testing details:
> 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 
> 256MB block size.  This results in 6 "primary" blocks on each DN.  With 
> replication=3, there will be on average 12 more copies on each DN that are 
> copies of blocks from other DNs.  Because of the random selection of DNs, 
> exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of 
> all DNs!) received absolutely no copies, while one DN received over 100 
> copies, and the elapsed time increased by about 3X from a pre-HDFS-6268 
> distro.  There was no pattern to which DNs didn't receive copies, nor was the 
> set of such DNs repeatable run-to-run. In addition to the performance 
> problem, there could be capacity problems due to one or a few DNs running out 
> of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I 
> don't see a significant difference from the Apache Hadoop source in this 
> regard. The workaround to recover the previous behavior is to set 
> dfs.namenode.handler.count=1 but of course this has scaling implications for 
> large clusters.
> I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a 
> better algorithm can be implemented and tested.  Testing should include a 
> case with many DNs and a small number of blocks on each.
> It should also be noted that even pre-HDFS-6268, the random choice of DN 
> algorithm produces a rather non-uniform distribution of copies.  This is not 
> due to any bug, but purely a case of random distributions being much less 
> uniform than one might intuitively expect. In the above case, pre-HDFS-6268 
> yields something like a range of 3 to 25 block copies on each DN. 
> Surprisingly, the performance penalty of this non-uniformity is not as big as 
> might be expected (maybe only 10-20%), but HDFS should do better, and in any 
> case the capacity issue remains.  Round-robin choice of DN?  Better awareness 
> of which DNs currently store fewer blocks? It's not sufficient that the total 
> number of blocks is similar on each DN at the end, but that at each point in 
> time no individual DN receives a disproportionate number of blocks at once 
> (which could be a danger of a RR algorithm).
> Probably should limit this jira to tracking the ThreadLocal issue, and track 
> the random choice issue in another one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5796) The file system browser in the namenode UI requires SPNEGO.

2015-02-02 Thread Haohui Mai (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302490#comment-14302490
 ] 

Haohui Mai commented on HDFS-5796:
--

bq. Since this would be a real valid user, hdfs admin can apply normal access 
grants / restrictions on this user..

I don't quite follow. Does the user need to able to read all files in the HDFS 
cluster in order for the UI to work? What kinds of access controls do you plan 
to apply on the particular user?

>From a security prospective, I think that it is a no-go if users that are 
>using the browser and users that are using standard RPC interfaces are treated 
>differently -- it can easily lead to misconfiguration and security 
>vulnerabilities.


> The file system browser in the namenode UI requires SPNEGO.
> ---
>
> Key: HDFS-5796
> URL: https://issues.apache.org/jira/browse/HDFS-5796
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.5.0
>Reporter: Kihwal Lee
>Assignee: Arun Suresh
> Attachments: HDFS-5796.1.patch, HDFS-5796.1.patch, HDFS-5796.2.patch, 
> HDFS-5796.3.patch, HDFS-5796.3.patch
>
>
> After HDFS-5382, the browser makes webhdfs REST calls directly, requiring 
> SPNEGO to work between user's browser and namenode.  This won't work if the 
> cluster's security infrastructure is isolated from the regular network.  
> Moreover, SPNEGO is not supposed to be required for user-facing web pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7719) BlockPoolSliceStorage could not remove storageDir.

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302347#comment-14302347
 ] 

Hadoop QA commented on HDFS-7719:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12695998/HDFS-7719.002.patch
  against trunk revision 5f9a0dd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestLeaseRecovery2
  org.apache.hadoop.hdfs.server.balancer.TestBalancer

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9399//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9399//console

This message is automatically generated.

> BlockPoolSliceStorage could not remove storageDir.
> --
>
> Key: HDFS-7719
> URL: https://issues.apache.org/jira/browse/HDFS-7719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7719.000.patch, HDFS-7719.001.patch, 
> HDFS-7719.002.patch
>
>
> The parameter of {{BlockPoolSliceStorage#removeVolumes()}} is a set of volume 
> level directories, thus {{BlockPoolSliceStorage}} could not directly compare 
> its own {{StorageDirs}} with this volume-level directory. The result of that 
> is {{BlockPoolSliceStorage}} did not actually remove the targeted 
> {{StorageDirectory}}. 
> It will cause failure when remove a volume and then immediately add a volume 
> back with the same mount point..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302345#comment-14302345
 ] 

Ming Ma commented on HDFS-7725:
---

Thanks, [~zhz]. Yes, the logic of "don't modify nn stats if the node is dead" 
can be moved to HeartbeatManager.

For the trunk version, we can wait for HDFS-7411. If HDFS-7411 isn't going to 
be in branch-2 anytime soon, then we will need some quick fix. Overall, can 
anyone find any correctness issue with the current patch?

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HDFS-7122) Use of ThreadLocal results in poor block placement

2015-02-02 Thread Jonathan Lawlor (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Lawlor reassigned HDFS-7122:
-

Assignee: Jonathan Lawlor  (was: Andrew Wang)

> Use of ThreadLocal results in poor block placement
> --
>
> Key: HDFS-7122
> URL: https://issues.apache.org/jira/browse/HDFS-7122
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.3.0
> Environment: medium-large environments with 100's to 1000's of DNs 
> will be most affected, but potentially all environments.
>Reporter: Jeff Buell
>Assignee: Jonathan Lawlor
>Priority: Blocker
>  Labels: performance
> Fix For: 2.6.0
>
> Attachments: copies_per_slave.jpg, 
> hdfs-7122-cdh5.1.2-testing.001.patch, hdfs-7122.001.patch
>
>
> Summary:
> Since HDFS-6268, the distribution of replica block copies across the 
> DataNodes (replicas 2,3,... as distinguished from the first "primary" 
> replica) is extremely poor, to the point that TeraGen slows down by as much 
> as 3X for certain configurations.  This is almost certainly due to the 
> introduction of Thread Local Random in HDFS-6268.  The mechanism appears to 
> be that this change causes all the random numbers in the threads to be 
> correlated, thus preventing a truly random choice of DN for each replica copy.
> Testing details:
> 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), 
> 256MB block size.  This results in 6 "primary" blocks on each DN.  With 
> replication=3, there will be on average 12 more copies on each DN that are 
> copies of blocks from other DNs.  Because of the random selection of DNs, 
> exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of 
> all DNs!) received absolutely no copies, while one DN received over 100 
> copies, and the elapsed time increased by about 3X from a pre-HDFS-6268 
> distro.  There was no pattern to which DNs didn't receive copies, nor was the 
> set of such DNs repeatable run-to-run. In addition to the performance 
> problem, there could be capacity problems due to one or a few DNs running out 
> of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I 
> don't see a significant difference from the Apache Hadoop source in this 
> regard. The workaround to recover the previous behavior is to set 
> dfs.namenode.handler.count=1 but of course this has scaling implications for 
> large clusters.
> I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a 
> better algorithm can be implemented and tested.  Testing should include a 
> case with many DNs and a small number of blocks on each.
> It should also be noted that even pre-HDFS-6268, the random choice of DN 
> algorithm produces a rather non-uniform distribution of copies.  This is not 
> due to any bug, but purely a case of random distributions being much less 
> uniform than one might intuitively expect. In the above case, pre-HDFS-6268 
> yields something like a range of 3 to 25 block copies on each DN. 
> Surprisingly, the performance penalty of this non-uniformity is not as big as 
> might be expected (maybe only 10-20%), but HDFS should do better, and in any 
> case the capacity issue remains.  Round-robin choice of DN?  Better awareness 
> of which DNs currently store fewer blocks? It's not sufficient that the total 
> number of blocks is similar on each DN at the end, but that at each point in 
> time no individual DN receives a disproportionate number of blocks at once 
> (which could be a danger of a RR algorithm).
> Probably should limit this jira to tracking the ThreadLocal issue, and track 
> the random choice issue in another one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302322#comment-14302322
 ] 

Arpit Agarwal commented on HDFS-7411:
-

bq.  Ming Ma, Arpit Agarwal, and myself have reviewed this.
I have not reviewed the patch. I just commented on a change to a single 
function.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302319#comment-14302319
 ] 

Colin Patrick McCabe commented on HDFS-7411:


I don't see any benefit to splitting the patch further.  The old logic is 
seriously flawed... it's not effective at rate limiting and often goes way too 
fast or too slow, because it is based on number of nodes rather than number of 
blocks.

I think it's great that [~andrew.wang] took on this task, which has been a 
maintenance problem for us for a while.  This is a great example of someone who 
really cares about the project making things better by working on something 
which is "boring" (it will never make it into a list of new features or 
exciting research talks) but essential to our users.

There is no benefit to keeping around multiple broken implementations of things 
to do the same job.  Hadoop has enough dead and obsolete code as is... more 
than enough.  Things like the {{RemoteBlockReader}} / {{RemoteBlockReader2}} 
split increase our maintenance burden and confuse users and potential 
contributors.  I only allowed the {{BlockReaderLocalLegacy}} / 
{{BlockReaderLocal}} split because we didn't have platform support for file 
descriptor passing on Windows.  But since there are no platform support issues 
here, there is no reason to increase our maintenance burden.

If we are concerned about stability, we can let this soak in trunk for a while.

It has been through three months of review.  [~mingma], [~arpitagarwal], and 
myself have reviewed this.   It's ready to go in, and I think it should.  +1.  
Let's commit this today if there are no other comments about the patch.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7520) checknative should display a nicer error message when openssl support is not compiled in

2015-02-02 Thread Anu Engineer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDFS-7520:
---
Status: Patch Available  (was: Open)

> checknative should display a nicer error message when openssl support is not 
> compiled in
> 
>
> Key: HDFS-7520
> URL: https://issues.apache.org/jira/browse/HDFS-7520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Colin Patrick McCabe
>Assignee: Anu Engineer
> Attachments: HDFS-7520.001.patch
>
>
> checknative should display a nicer error message when openssl support is not 
> compiled in.  Currently, it displays this:
> {code}
> [cmccabe@keter hadoop]$ hadoop checknative
> 14/12/12 14:08:43 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 14/12/12 14:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib64/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: false org.apache.hadoop.crypto.OpensslCipher.initIDs()V
> {code}
> Instead, we should display something like this, if openssl is not supported 
> by the current build:
> {code}
> openssl: false Hadoop was built without openssl support.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302316#comment-14302316
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7411:
---

> Because of this, I do not see any advantage to keeping this old code around. 
> ...

Users are familiar with the existing behavior so that we have to keep the old 
code, deprecate them first and then remove them later.  It is a standard 
procedure for changing the behavior of a feature.  No?

> I still plan to commit tomorrow.

-1 on the patch.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7520) checknative should display a nicer error message when openssl support is not compiled in

2015-02-02 Thread Anu Engineer (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anu Engineer updated HDFS-7520:
---
Attachment: HDFS-7520.001.patch

> checknative should display a nicer error message when openssl support is not 
> compiled in
> 
>
> Key: HDFS-7520
> URL: https://issues.apache.org/jira/browse/HDFS-7520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Colin Patrick McCabe
>Assignee: Anu Engineer
> Attachments: HDFS-7520.001.patch
>
>
> checknative should display a nicer error message when openssl support is not 
> compiled in.  Currently, it displays this:
> {code}
> [cmccabe@keter hadoop]$ hadoop checknative
> 14/12/12 14:08:43 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 14/12/12 14:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib64/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: false org.apache.hadoop.crypto.OpensslCipher.initIDs()V
> {code}
> Instead, we should display something like this, if openssl is not supported 
> by the current build:
> {code}
> openssl: false Hadoop was built without openssl support.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7520) checknative should display a nicer error message when openssl support is not compiled in

2015-02-02 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302310#comment-14302310
 ] 

Anu Engineer commented on HDFS-7520:


This patch defines HADOOP_OPENSSL_LIBRARY  if and only if we find a 
USABLE_OPENSSL. Based on Chris analysis, it is possible for us to find a 
invalid OpenSSL installation but still  have  _buildSupportsOpenssl() function 
return true, since the HADOOP_OPENSSL_LIBRARY gets defined in CMake file.

> checknative should display a nicer error message when openssl support is not 
> compiled in
> 
>
> Key: HDFS-7520
> URL: https://issues.apache.org/jira/browse/HDFS-7520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Colin Patrick McCabe
>Assignee: Anu Engineer
>
> checknative should display a nicer error message when openssl support is not 
> compiled in.  Currently, it displays this:
> {code}
> [cmccabe@keter hadoop]$ hadoop checknative
> 14/12/12 14:08:43 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 14/12/12 14:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib64/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: false org.apache.hadoop.crypto.OpensslCipher.initIDs()V
> {code}
> Instead, we should display something like this, if openssl is not supported 
> by the current build:
> {code}
> openssl: false Hadoop was built without openssl support.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302299#comment-14302299
 ] 

Zhe Zhang commented on HDFS-7725:
-

bq. However, HDFS-7374 introduces another inconsistency when recomm is involved.
The second sequence in the JIRA description looks correct?

By reading the HDFS-7374 patch I see the potential issue is that 
{{HeartbeatManager}} is bypassed when decommissioning a dead node:
{code}
+if (!node.isDecommissionInProgress()) {
+  if (!node.isAlive) {
+LOG.info("Dead node " + node + " is decommissioned immediately.");
+node.setDecommissioned();
+  } else if (!node.isDecommissioned()) {
+for (DatanodeStorageInfo storage : node.getStorageInfos()) {
+  LOG.info("Start Decommissioning " + node + " " + storage
+  + " with " + storage.numBlocks() + " blocks");
+}
+heartbeatManager.startDecommission(node);
{code}

It seems {{DatanodeManager}} should still route the call to 
{{HeartbeatManager}}, and {{HeartbeatManager#startDecommission}} should handle 
the dead node logic. 

Maybe we should wait for HDFS-7411 to be committed and revisit the change? 

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7721) The HDFS BlockScanner may run fast during the first hour

2015-02-02 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302293#comment-14302293
 ] 

Andrew Wang commented on HDFS-7721:
---

+1 pending, looks good

> The HDFS BlockScanner may run fast during the first hour
> 
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7721) The HDFS BlockScanner may run fast during the first hour

2015-02-02 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7721:
---
Summary: The HDFS BlockScanner may run fast during the first hour  (was: 
TestBlockScanner.testScanRateLimit may fail)

> The HDFS BlockScanner may run fast during the first hour
> 
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7721) TestBlockScanner.testScanRateLimit may fail

2015-02-02 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302288#comment-14302288
 ] 

Colin Patrick McCabe commented on HDFS-7721:


* Avoid running the scanner fast in the first hour when the VolumeScanner 
starts.  The reason why it was running fast was because we were looking at the 
average I/O over the last hour, and treating minutes before the daemon started 
up as minutes when we did no I/O.  Instead, we should average only over the 
time that the daemon has been up during the first hour.
* Add a more helpful error message to the {{testScanRateLimit}} test.
* Make the {{testScanRateLimit}} test more consistent by waiting for some 
blocks to be scanned before waiting and then checking the effective rate.  This 
will ensure that the volume scanner has started.

> TestBlockScanner.testScanRateLimit may fail
> ---
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7721) TestBlockScanner.testScanRateLimit may fail

2015-02-02 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7721:
---
Attachment: HDFS-7721.001.patch

> TestBlockScanner.testScanRateLimit may fail
> ---
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7721) TestBlockScanner.testScanRateLimit may fail

2015-02-02 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7721:
---
Assignee: Colin Patrick McCabe
  Status: Patch Available  (was: Open)

> TestBlockScanner.testScanRateLimit may fail
> ---
>
> Key: HDFS-7721
> URL: https://issues.apache.org/jira/browse/HDFS-7721
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-7721.001.patch
>
>
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9375//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> - 
> https://builds.apache.org/job/PreCommit-HDFS-Build/9365//testReport/org.apache.hadoop.hdfs.server.datanode/TestBlockScanner/testScanRateLimit/
> {code}
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestBlockScanner.testScanRateLimit(TestBlockScanner.java:439)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7712) Switch blockStateChangeLog to use slf4j

2015-02-02 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-7712:
--
Attachment: hdfs-7712.004.patch

Sigh, these failed tests didn't show up in the previous run, new rev to address 
again. Sorry for all the noise.

> Switch blockStateChangeLog to use slf4j
> ---
>
> Key: HDFS-7712
> URL: https://issues.apache.org/jira/browse/HDFS-7712
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Minor
> Attachments: hdfs-7712.001.patch, hdfs-7712.002.patch, 
> hdfs-7712.003.patch, hdfs-7712.004.patch
>
>
> As pointed out in HDFS-7706, updating blockStateChangeLog to use slf4j will 
> save a lot of string construction costs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7719) BlockPoolSliceStorage could not remove storageDir.

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302281#comment-14302281
 ] 

Hadoop QA commented on HDFS-7719:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12695990/HDFS-7719.001.patch
  against trunk revision 8004a00.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9397//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9397//console

This message is automatically generated.

> BlockPoolSliceStorage could not remove storageDir.
> --
>
> Key: HDFS-7719
> URL: https://issues.apache.org/jira/browse/HDFS-7719
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Lei (Eddy) Xu
>Assignee: Lei (Eddy) Xu
> Attachments: HDFS-7719.000.patch, HDFS-7719.001.patch, 
> HDFS-7719.002.patch
>
>
> The parameter of {{BlockPoolSliceStorage#removeVolumes()}} is a set of volume 
> level directories, thus {{BlockPoolSliceStorage}} could not directly compare 
> its own {{StorageDirs}} with this volume-level directory. The result of that 
> is {{BlockPoolSliceStorage}} did not actually remove the targeted 
> {{StorageDirectory}}. 
> It will cause failure when remove a volume and then immediately add a volume 
> back with the same mount point..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

2015-02-02 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302277#comment-14302277
 ] 

Yongjun Zhang commented on HDFS-7707:
-

Hi Kihwal,

Inspired by your comment 
https://issues.apache.org/jira/browse/HDFS-7707?focusedCommentId=14299106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14299106

I think I have a better solution now. That is, instead of checking the name 
string, check the inode id. Comparing the inode id of the deleted file/dir 
against a newly created inode id will mismatch, thus the detecting that the 
file/dir was deleted.

Thanks.




> Edit log corruption due to delayed block removal again
> --
>
> Key: HDFS-7707
> URL: https://issues.apache.org/jira/browse/HDFS-7707
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
> Attachments: reproduceHDFS-7707.patch
>
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302092#comment-14302092
 ] 

Andrew Wang commented on HDFS-7411:
---

As discussed above, the old limiting scheme is seriously flawed. The amount of 
time spent is highly variable, since it's # nodes rather than # blocks, and the 
size of each node is variable. It also counts both decommissioning and not 
decommissioning nodes towards the limit.

That nodes can vary in # of blocks and is really an argument for *not* using # 
nodes as a limit. # of blocks is superior. The 100k was chosen as a 
conservative number that will not lead to overly long wake-up times, which is 
the point of this limit. In fact, with this patch we should see far more 
predictable pause times for decommission work even with the old config. In 
addition, it'll also result in an improvement in overall decommission speed 
because of the incremental scan logic.

Because of this, I do not see any advantage to keeping this old code around. 
The old code is worse in terms of predictable pause times and overall 
decommissioning speed. It also has other flaws that are corrected by this 
patch. The new code is compatible with the old configuration. It also requires 
a lot of work to split the refactoring.

I still plan to commit tomorrow.

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Kai Zheng (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302086#comment-14302086
 ] 

Kai Zheng commented on HDFS-7285:
-

bq.I think we need to allow dynamic stripe cell size depends on the file size.
Good idea. Small strip cell size for small files in a zone, and large strip 
cell size for large files in another zone. For MR or data locality sensitive 
files, use larger cell size. As we're going to support various stripping and EC 
forms by configurable schema and file system zones, different stripping cell 
size is possible I guess. 

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7726:
-
Status: Patch Available  (was: Open)

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7726:
-
Attachment: check_config_val_EditLogTailer.patch.1

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
> Attachments: check_config_val_EditLogTailer.patch.1
>
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7726:
-
Status: Patch Available  (was: Open)

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianyin Xu updated HDFS-7726:
-
Status: Open  (was: Patch Available)

> Parse and check the configuration settings of edit log to prevent runtime 
> errors
> 
>
> Key: HDFS-7726
> URL: https://issues.apache.org/jira/browse/HDFS-7726
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Tianyin Xu
>Priority: Minor
>
> 
> Problem
> -
> Similar as the following two issues addressed in 2.7.0,
> https://issues.apache.org/jira/browse/YARN-2165
> https://issues.apache.org/jira/browse/YARN-2166
> The edit log related configuration settings should be checked in the 
> constructor rather than being applied directly at runtime. This would cause 
> runtime failures if the values are wrong.
> Take "dfs.ha.tail-edits.period" as an example, currently in 
> EditLogTailer.java, its value is not checked but directly used in doWork(), 
> as the following code snippets. Any negative values would cause 
> IllegalArgumentException (which is not caught) and impair the component. 
> {code:title=EditLogTailer.java|borderStyle=solid}
> private void doWork() {
> {
> .
> Thread.sleep(sleepTimeMs);
> 
> }
> {code}
> Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() 
> to parse the value at runtime in the getActiveNodeProxy() function which is 
> called by doWork(), shown as below. Any erroneous settings (e.g., 
> ill-formatted integer) would cause exceptions.
> {code:title=EditLogTailer.java|borderStyle=solid}
> private NamenodeProtocol getActiveNodeProxy() throws IOException {
> {
> .
> int rpcTimeout = conf.getInt(
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
>   DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);
> 
> }
> {code}
> 
> Solution (the attached patch)
> -
> Basically, the idea of the attached patch is to move the parsing and checking 
> logics into the constructor to expose the error at initialization, so that 
> the errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)
> I'm not aware of the implementation of 2.7.0. It seems there's checking 
> utilities such as the validatePositiveNonZero function in YARN-2165. If so, 
> we can use that one to make the checking more systematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7714) Simultaneous restart of HA NameNodes and DataNode can cause DataNode to register successfully with only one NameNode.

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302077#comment-14302077
 ] 

Hadoop QA commented on HDFS-7714:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12695806/HDFS-7714-001.patch
  against trunk revision 3472e3b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 
release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations
  org.apache.hadoop.hdfs.TestDFSRollback
  org.apache.hadoop.hdfs.TestDFSUpgrade
  org.apache.hadoop.hdfs.server.datanode.TestBPOfferService
  org.apache.hadoop.hdfs.TestRollingUpgrade
  org.apache.hadoop.net.TestNetworkTopology
  org.apache.hadoop.hdfs.TestDFSStartupVersions

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9394//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9394//artifact/patchprocess/patchReleaseAuditProblems.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9394//console

This message is automatically generated.

> Simultaneous restart of HA NameNodes and DataNode can cause DataNode to 
> register successfully with only one NameNode.
> -
>
> Key: HDFS-7714
> URL: https://issues.apache.org/jira/browse/HDFS-7714
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0
>Reporter: Chris Nauroth
>Assignee: Vinayakumar B
> Attachments: HDFS-7714-001.patch
>
>
> In an HA deployment, DataNodes must register with both NameNodes and send 
> periodic heartbeats and block reports to both.  However, if NameNodes and 
> DataNodes are restarted simultaneously, then this can trigger a race 
> condition in registration.  The end result is that the {{BPServiceActor}} for 
> one NameNode terminates, but the {{BPServiceActor}} for the other NameNode 
> remains alive.  The DataNode process is then in a "half-alive" state where it 
> only heartbeats and sends block reports to one of the NameNodes.  This could 
> cause a loss of storage capacity after an HA failover.  The DataNode process 
> would have to be restarted to resolve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7726) Parse and check the configuration settings of edit log to prevent runtime errors

2015-02-02 Thread Tianyin Xu (JIRA)
Tianyin Xu created HDFS-7726:


 Summary: Parse and check the configuration settings of edit log to 
prevent runtime errors
 Key: HDFS-7726
 URL: https://issues.apache.org/jira/browse/HDFS-7726
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Tianyin Xu
Priority: Minor



Problem
-

Similar as the following two issues addressed in 2.7.0,
https://issues.apache.org/jira/browse/YARN-2165
https://issues.apache.org/jira/browse/YARN-2166

The edit log related configuration settings should be checked in the 
constructor rather than being applied directly at runtime. This would cause 
runtime failures if the values are wrong.

Take "dfs.ha.tail-edits.period" as an example, currently in EditLogTailer.java, 
its value is not checked but directly used in doWork(), as the following code 
snippets. Any negative values would cause IllegalArgumentException (which is 
not caught) and impair the component. 

{code:title=EditLogTailer.java|borderStyle=solid}
private void doWork() {
{
.
Thread.sleep(sleepTimeMs);

}
{code}

Another example is "dfs.ha.log-roll.rpc.timeout". Right now, we use getInt() to 
parse the value at runtime in the getActiveNodeProxy() function which is called 
by doWork(), shown as below. Any erroneous settings (e.g., ill-formatted 
integer) would cause exceptions.

{code:title=EditLogTailer.java|borderStyle=solid}
private NamenodeProtocol getActiveNodeProxy() throws IOException {
{
.
int rpcTimeout = conf.getInt(
  DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_KEY,
  DFSConfigKeys.DFS_HA_LOGROLL_RPC_TIMEOUT_DEFAULT);

}
{code}


Solution (the attached patch)
-

Basically, the idea of the attached patch is to move the parsing and checking 
logics into the constructor to expose the error at initialization, so that the 
errors won't be latent at the runtime (same as YARN-2165 and YARN-2166)

I'm not aware of the implementation of 2.7.0. It seems there's checking 
utilities such as the validatePositiveNonZero function in YARN-2165. If so, we 
can use that one to make the checking more systematic.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7704) DN heartbeat to Active NN may be blocked and expire if connection to Standby NN continues to time out.

2015-02-02 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302067#comment-14302067
 ] 

Kihwal Lee commented on HDFS-7704:
--

Adding to what Charles said: If {{BPServiceActorAction}} is specifically for 
reporting to namenode, you could add something like 
{{reportTo(DatanodeProtocolClientSideTranslatorPB bpNamenode)}} method and have 
the implementation in each subclass do its own thing. Except when one is 
created, the rest of the code won't have to know the specific type of the 
instance.  It will make batching bad block reporting difficult.

In the current patch, the way {{bpThreadQueue}} is synchronized will block 
{{bpThreadEnqueue()}}, if RPC call blocks.  Instead, you could create a new 
collection containing the content of the queue in a synchronized block, and 
then call the report() method of each one outside the synchronized block.

If you want to make batching of bad block reporting work, it may not be worth 
trying to introduce the unified {{BPServiceActorAction}} concept.  For bad 
blocks, "{{enqueue()}}" or "{{add()}}" can put thing directly to an ArrayList, 
which simplifies the aggregation on reporting time.  If you believe  
{{BPServiceActorAction}}-based abstraction will provide more value in the 
future, giving up on batched bad block reporting is okay. After all, datanode 
is not doing it today.

> DN heartbeat to Active NN may be blocked and expire if connection to Standby 
> NN continues to time out. 
> ---
>
> Key: HDFS-7704
> URL: https://issues.apache.org/jira/browse/HDFS-7704
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.5.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
> Attachments: HDFS-7704-v2.patch, HDFS-7704.patch
>
>
> There are couple of synchronous calls in BPOfferservice (i.e reportBadBlocks 
> and trySendErrorReport) which will wait for both of the actor threads to 
> process this calls.
> This calls are made with writeLock acquired.
> When reportBadBlocks() is blocked at the RPC layer due to unreachable NN, 
> subsequent heartbeat response processing has to wait for the write lock. It 
> eventually gets through, but takes too long and it blocks the next heartbeat.
> In our HA cluster setup, the standby namenode was taking a long time to 
> process the request.
> Requesting improvement in datanode to make the above calls asynchronous since 
> these reports don't have any specific
> deadlines, so extra few seconds of delay should be acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7285) Erasure Coding Support inside HDFS

2015-02-02 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302065#comment-14302065
 ] 

Zhe Zhang commented on HDFS-7285:
-

bq. If we only use small fixed value, for example 64KB as the stripe cell size, 
then for large file, we need much more ec block groups to store the entire file 
than the number of blocks we need using replication way,
The number of block groups is actually unrelated to the cell size (e.g. 64KB). 
For example, under a 6+3 schema, any file smaller than 9 blocks will have 1 
block group.

A smaller cell size better handles small files. But data locality is degraded 
-- for example, it might be hard to fit MapReduce records into 64KB cells.

> Erasure Coding Support inside HDFS
> --
>
> Key: HDFS-7285
> URL: https://issues.apache.org/jira/browse/HDFS-7285
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Weihua Jiang
>Assignee: Zhe Zhang
> Attachments: ECAnalyzer.py, ECParser.py, 
> HDFSErasureCodingDesign-20141028.pdf, HDFSErasureCodingDesign-20141217.pdf, 
> fsimage-analysis-20150105.pdf
>
>
> Erasure Coding (EC) can greatly reduce the storage overhead without sacrifice 
> of data reliability, comparing to the existing HDFS 3-replica approach. For 
> example, if we use a 10+4 Reed Solomon coding, we can allow loss of 4 blocks, 
> with storage overhead only being 40%. This makes EC a quite attractive 
> alternative for big data storage, particularly for cold data. 
> Facebook had a related open source project called HDFS-RAID. It used to be 
> one of the contribute packages in HDFS but had been removed since Hadoop 2.0 
> for maintain reason. The drawbacks are: 1) it is on top of HDFS and depends 
> on MapReduce to do encoding and decoding tasks; 2) it can only be used for 
> cold files that are intended not to be appended anymore; 3) the pure Java EC 
> coding implementation is extremely slow in practical use. Due to these, it 
> might not be a good idea to just bring HDFS-RAID back.
> We (Intel and Cloudera) are working on a design to build EC into HDFS that 
> gets rid of any external dependencies, makes it self-contained and 
> independently maintained. This design lays the EC feature on the storage type 
> support and considers compatible with existing HDFS features like caching, 
> snapshot, encryption, high availability and etc. This design will also 
> support different EC coding schemes, implementations and policies for 
> different deployment scenarios. By utilizing advanced libraries (e.g. Intel 
> ISA-L library), an implementation can greatly improve the performance of EC 
> encoding/decoding and makes the EC solution even more attractive. We will 
> post the design document soon. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7270) Add congestion signaling capability to DataNode write protocol

2015-02-02 Thread Arpit Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302053#comment-14302053
 ] 

Arpit Agarwal commented on HDFS-7270:
-

+1 from me on the v002 patch, except I don't think we need the nonce. Network 
routers use a nonce since they can't trust either endpoint and cannot maintain 
state per-stream.

> Add congestion signaling capability to DataNode write protocol
> --
>
> Key: HDFS-7270
> URL: https://issues.apache.org/jira/browse/HDFS-7270
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haohui Mai
>Assignee: Haohui Mai
> Attachments: HDFS-7270.000.patch, HDFS-7270.001.patch, 
> HDFS-7270.002.patch
>
>
> When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
> saturates the disk bandwidth and put the DNs unresponsive. The client only 
> backs off by aborting / recovering the pipeline, which leads to failed writes 
> and unnecessary pipeline recovery.
> This jira proposes to add explicit congestion control mechanisms in the 
> writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-5796) The file system browser in the namenode UI requires SPNEGO.

2015-02-02 Thread Arun Suresh (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302047#comment-14302047
 ] 

Arun Suresh commented on HDFS-5796:
---

ping [~wheat9] .. does the above suggestion work ?

> The file system browser in the namenode UI requires SPNEGO.
> ---
>
> Key: HDFS-5796
> URL: https://issues.apache.org/jira/browse/HDFS-5796
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.5.0
>Reporter: Kihwal Lee
>Assignee: Arun Suresh
> Attachments: HDFS-5796.1.patch, HDFS-5796.1.patch, HDFS-5796.2.patch, 
> HDFS-5796.3.patch, HDFS-5796.3.patch
>
>
> After HDFS-5382, the browser makes webhdfs REST calls directly, requiring 
> SPNEGO to work between user's browser and namenode.  This won't work if the 
> cluster's security infrastructure is isolated from the regular network.  
> Moreover, SPNEGO is not supposed to be required for user-facing web pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7718) DFSClient objects created by AbstractFileSystem objects created by FileContext are not closed and results in thread leakage

2015-02-02 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated HDFS-7718:
--
Attachment: HDFS-7718.2.patch

thanks for the review [~cmccabe],

Updating patch with your suggestions :

bq. Should we be comparing the URIs as strings instead? I'm not that familiar 
with URI#equals, what features of it make it the right thing here?
Hmmm.. We need to vefity that the String URI is actually a proper URI for which 
I create a URI object in anycase. Which is why I decided to use it as the key 
itself.
I checked the {{URI#equals()}} method. looks like it does a good job of 
comparing URIs. I've added a few tests to verify it as well 

> DFSClient objects created by AbstractFileSystem objects created by 
> FileContext are not closed and results in thread leakage
> ---
>
> Key: HDFS-7718
> URL: https://issues.apache.org/jira/browse/HDFS-7718
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: HDFS-7718.1.patch, HDFS-7718.2.patch
>
>
> Currently, the {{FileContext}} class used by clients such as (for eg. 
> {{YARNRunner}}) creates a new {{AbstractFilesystem}} object on 
> initialization.. which creates a new {{DFSClient}} object.. which in turn 
> creates a KeyProvider object.. If Encryption is turned on, and https is 
> turned on, the keyprovider implementation (the {{KMSClientProvider}}) will 
> create a {{ReloadingX509TrustManager}} thread per instance... which are never 
> killed and can lead to a thread leak



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302037#comment-14302037
 ] 

Hadoop QA commented on HDFS-7725:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12695976/HDFS-7725.patch
  against trunk revision f33c99b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.balancer.TestBalancer

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9396//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9396//console

This message is automatically generated.

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7712) Switch blockStateChangeLog to use slf4j

2015-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302036#comment-14302036
 ] 

Hadoop QA commented on HDFS-7712:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12695978/hdfs-7712.003.patch
  against trunk revision f33c99b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 16 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes
  org.apache.hadoop.fs.TestSymlinkHdfsFileContext
  org.apache.hadoop.fs.TestSymlinkHdfsFileSystem
  
org.apache.hadoop.hdfs.server.namenode.TestNNThroughputBenchmark
  org.apache.hadoop.hdfs.TestPipelines

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9395//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9395//console

This message is automatically generated.

> Switch blockStateChangeLog to use slf4j
> ---
>
> Key: HDFS-7712
> URL: https://issues.apache.org/jira/browse/HDFS-7712
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Minor
> Attachments: hdfs-7712.001.patch, hdfs-7712.002.patch, 
> hdfs-7712.003.patch
>
>
> As pointed out in HDFS-7706, updating blockStateChangeLog to use slf4j will 
> save a lot of string construction costs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7411) Refactor and improve decommissioning logic into DecommissionManager

2015-02-02 Thread Tsz Wo Nicholas Sze (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302030#comment-14302030
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7411:
---

I also like to get this work here committed faster.  So, I suggest
# Create another JIRA for a pure code refactoring.  Move all the existing code 
to DecommissionManager.  No logic change.
# Change the patch here to add the new decommission code but NOT removing the 
existing code so that it uses the old code if the old conf is set and the new 
code if new conf is set.

Sound good?

> Refactor and improve decommissioning logic into DecommissionManager
> ---
>
> Key: HDFS-7411
> URL: https://issues.apache.org/jira/browse/HDFS-7411
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.5.1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: hdfs-7411.001.patch, hdfs-7411.002.patch, 
> hdfs-7411.003.patch, hdfs-7411.004.patch, hdfs-7411.005.patch, 
> hdfs-7411.006.patch, hdfs-7411.007.patch, hdfs-7411.008.patch, 
> hdfs-7411.009.patch, hdfs-7411.010.patch
>
>
> Would be nice to split out decommission logic from DatanodeManager to 
> DecommissionManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.

2015-02-02 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302025#comment-14302025
 ] 

Chris Nauroth edited comment on HDFS-7604 at 2/2/15 10:05 PM:
--

I've done another mock-up of the UI.  This version avoids adding clutter to the 
existing Datanodes page and instead moves failure information to its own 
dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for 
Total Failed Volumes.  I also intend to display lost capacity in parentheses 
next to it.  However, unlike last time, the existing Datanodes page is 
unchanged.  Instead, the volume failure information is on a new Datanode 
Volumes page, shown in new screenshot 4.  This is hyperlinked from both the 
Total Failed Volumes field in the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume 
failures.  For each one, it displays the address, seconds since last contact, 
time of last volume failure, number of failed volumes, estimated capacity lost 
due to these volume failures, and a list of every failed storage location's 
path.  I say that the capacity lost is an estimate, because there are going to 
be some edge cases that could prevent us from displaying accurate information 
here.  For example, if a volume has an I/O error before we get a chance to 
check its capacity, then it's unknown how much storage is available on that 
volume.

The end user workflow I imagine for this is that an admin first checks the 
summary information and notices a non-zero count for failed volumes.  Then, the 
admin navigates to the Datanode Volumes page to get a list of volume failures 
across the cluster.  This view lists only the DataNodes with volume failures, 
so the admin won't need to scan through the master list looking for individual 
nodes with a non-zero volume failure count.  This can act as a sort of work 
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to 
provide this data for the UI.  Meanwhile, Last Failure Time and Estimated 
Capacity Lost are displayed as TODO in the screenshot.  Further feedback is 
welcome while I continue coding a new patch.


was (Author: cnauroth):
I've done another mock-up of the UI.  This version avoids adding clutter to the 
existing Datanodes page and instead moves failure information to its own 
dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for 
Total Failed Volumes.  I also intend to display lost capacity in parentheses 
next to it.  However, unlike last time, the existing Datanodes page is 
unchanged.  Instead, the volume failure information is on a new Datanode 
Volumes page.  This is hyperlinked from both the Total Failed Volumes field in 
the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume 
failures.  For each one, it displays the address, seconds since last contact, 
time of last volume failure, number of failed volumes, estimated capacity lost 
due to these volume failures, and a list of every failed storage location's 
path.  I say that the capacity lost is an estimate, because there are going to 
be some edge cases that could prevent us from displaying accurate information 
here.  For example, if a volume has an I/O error before we get a chance to 
check its capacity, then it's unknown how much storage is available on that 
volume.

The end user workflow I imagine for this is that an admin first checks the 
summary information and notices a non-zero count for failed volumes.  Then, the 
admin navigates to the Datanode Volumes page to get a list of volume failures 
across the cluster.  This view lists only the DataNodes with volume failures, 
so the admin won't need to scan through the master list looking for individual 
nodes with a non-zero volume failure count.  This can act as a sort of work 
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to 
provide this data for the UI.  Meanwhile, Last Failure Time and Estimated 
Capacity Lost are displayed as TODO in the screenshot.  Further feedback is 
welcome while I continue coding a new patch.

> Track and display failed DataNode storage locations in NameNode.
> 
>
> Key: HDFS-7604
> URL: https://issues.apache.org/jira/browse/HDFS-7604
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, 
> HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604.001.patch, 
>

[jira] [Updated] (HDFS-7604) Track and display failed DataNode storage locations in NameNode.

2015-02-02 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7604:

Attachment: HDFS-7604-screenshot-4.png

I've done another mock-up of the UI.  This version avoids adding clutter to the 
existing Datanodes page and instead moves failure information to its own 
dedicated page.

Just like in the existing screenshot 3, there is a new field on the summary for 
Total Failed Volumes.  I also intend to display lost capacity in parentheses 
next to it.  However, unlike last time, the existing Datanodes page is 
unchanged.  Instead, the volume failure information is on a new Datanode 
Volumes page.  This is hyperlinked from both the Total Failed Volumes field in 
the summary and a new tab in the top nav.

The new page has a table displaying only the DataNodes that have volume 
failures.  For each one, it displays the address, seconds since last contact, 
time of last volume failure, number of failed volumes, estimated capacity lost 
due to these volume failures, and a list of every failed storage location's 
path.  I say that the capacity lost is an estimate, because there are going to 
be some edge cases that could prevent us from displaying accurate information 
here.  For example, if a volume has an I/O error before we get a chance to 
check its capacity, then it's unknown how much storage is available on that 
volume.

The end user workflow I imagine for this is that an admin first checks the 
summary information and notices a non-zero count for failed volumes.  Then, the 
admin navigates to the Datanode Volumes page to get a list of volume failures 
across the cluster.  This view lists only the DataNodes with volume failures, 
so the admin won't need to scan through the master list looking for individual 
nodes with a non-zero volume failure count.  This can act as a sort of work 
queue for the admin recovering or replacing disks.

I have not updated the patch.  I need to rework the heartbeat information to 
provide this data for the UI.  Meanwhile, Last Failure Time and Estimated 
Capacity Lost are displayed as TODO in the screenshot.  Further feedback is 
welcome while I continue coding a new patch.

> Track and display failed DataNode storage locations in NameNode.
> 
>
> Key: HDFS-7604
> URL: https://issues.apache.org/jira/browse/HDFS-7604
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, 
> HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604.001.patch, 
> HDFS-7604.prototype.patch
>
>
> During heartbeats, the DataNode can report a list of its storage locations 
> that have been taken out of service due to failure (such as due to a bad disk 
> or a permissions problem).  The NameNode can track these failed storage 
> locations and then report them in JMX and the NameNode web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >