date:20141114


[ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212009#comment-14212009
 ] 

Hadoop QA commented on HDFS-7270:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681462/HDFS-7270.000.patch
  against trunk revision d005404.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestCrcCorruption

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestClientProtocolForPipelineRecovery

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8737//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8737//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8737//console

This message is automatically generated.

 Implementing congestion control in writing pipeline
 ---

 Key: HDFS-7270
 URL: https://issues.apache.org/jira/browse/HDFS-7270
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7270.000.patch


 When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
 saturates the disk bandwidth and put the DNs unresponsive. The client only 
 backs off by aborting / recovering the pipeline, which leads to failed writes 
 and unnecessary pipeline recovery.
 This jira proposes to add explicit congestion control mechanisms in the 
 writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212025#comment-14212025
 ] 

Hadoop QA commented on HDFS-7374:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681495/HDFS-7374-002.patch
  against trunk revision d005404.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestDecommission

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8738//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8738//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8738//console

This message is automatically generated.

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7056) Snapshot support for truncate


[ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212052#comment-14212052
 ] 

Hadoop QA commented on HDFS-7056:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12681500/HDFS-3107-HDFS-7056-combined.patch
  against trunk revision d005404.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8739//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8739//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8739//console

This message is automatically generated.

 Snapshot support for truncate
 -

 Key: HDFS-7056
 URL: https://issues.apache.org/jira/browse/HDFS-7056
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: 3.0.0
Reporter: Konstantin Shvachko
Assignee: Plamen Jeliazkov
 Attachments: HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, 
 HDFS-7056.patch, HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx


 Implementation of truncate in HDFS-3107 does not allow truncating files which 
 are in a snapshot. It is desirable to be able to truncate and still keep the 
 old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit


[ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212080#comment-14212080
 ] 

Hadoop QA commented on HDFS-7398:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681503/HDFS-7398.v01.patch
  against trunk revision d005404.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8740//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8740//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8740//console

This message is automatically generated.

 Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
 --

 Key: HDFS-7398
 URL: https://issues.apache.org/jira/browse/HDFS-7398
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.6.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: HDFS-7398.v01.patch


 This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2014-11-14 Thread Frantisek Vacek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frantisek Vacek updated HDFS-7392:
--
Description: 
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.

  was:
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.merck.com:8020/someDir/someFile.txt) should point to 
valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.merck.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.merck.com canonical name = 
internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com.
Name:   internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-gicprg-share-merck-com-1538706884.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.


 org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
 -

 Key: HDFS-7392
 URL: https://issues.apache.org/jira/browse/HDFS-7392
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Frantisek Vacek
Priority: Critical
 Attachments: 1.png, 2.png


 In some specific circumstances, 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
 and last forever.
 What are specific circumstances:
 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
 to valid IP address but without name node service running on it.
 2) There should be at least 2 IP addresses for such a URI. See output below:
 {quote}
 [~/proj/quickbox]$ nslookup share.example.com
 Server: 127.0.1.1
 Address:127.0.1.1#53
 share.example.com canonical name = 
 internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 54.40.29.223
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 54.40.29.65
 {quote}
 In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
 returns sometimes true (even if address didn't actually changed see img. 1) 
 and the timeoutFailures counter is set to 0 (see img. 2). The 
 maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
 repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7392) org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever

2014-11-14 Thread Frantisek Vacek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frantisek Vacek updated HDFS-7392:
--
Description: 
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 192.168.1.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 192.168.1.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.

  was:
In some specific circumstances, 
org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
and last forever.

What are specific circumstances:
1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
to valid IP address but without name node service running on it.
2) There should be at least 2 IP addresses for such a URI. See output below:
{quote}
[~/proj/quickbox]$ nslookup share.example.com
Server: 127.0.1.1
Address:127.0.1.1#53

share.example.com canonical name = 
internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.223
Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
Address: 54.40.29.65
{quote}
In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
returns sometimes true (even if address didn't actually changed see img. 1) and 
the timeoutFailures counter is set to 0 (see img. 2). The 
maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
repeated forever.


 org.apache.hadoop.hdfs.DistributedFileSystem open invalid URI forever
 -

 Key: HDFS-7392
 URL: https://issues.apache.org/jira/browse/HDFS-7392
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Frantisek Vacek
Priority: Critical
 Attachments: 1.png, 2.png


 In some specific circumstances, 
 org.apache.hadoop.hdfs.DistributedFileSystem.open(invalid URI) never timeouts 
 and last forever.
 What are specific circumstances:
 1) HDFS URI (hdfs://share.example.com:8020/someDir/someFile.txt) should point 
 to valid IP address but without name node service running on it.
 2) There should be at least 2 IP addresses for such a URI. See output below:
 {quote}
 [~/proj/quickbox]$ nslookup share.example.com
 Server: 127.0.1.1
 Address:127.0.1.1#53
 share.example.com canonical name = 
 internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com.
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.223
 Name:   internal-realm-share-example-com-1234.us-east-1.elb.amazonaws.com
 Address: 192.168.1.65
 {quote}
 In such a case the org.apache.hadoop.ipc.Client.Connection.updateAddress() 
 returns sometimes true (even if address didn't actually changed see img. 1) 
 and the timeoutFailures counter is set to 0 (see img. 2). The 
 maxRetriesOnSocketTimeouts (45) is never reached and connection attempt is 
 repeated forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212159#comment-14212159
 ] 

Hudson commented on HDFS-7385:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212165#comment-14212165
 ] 

Hudson commented on HDFS-7358:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212161#comment-14212161
 ] 

Hudson commented on HDFS-7395:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212190#comment-14212190
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212194#comment-14212194
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212188#comment-14212188
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #743 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/743/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212280#comment-14212280
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212282#comment-14212282
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212286#comment-14212286
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1933 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1933/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212299#comment-14212299
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212295#comment-14212295
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212293#comment-14212293
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212359#comment-14212359
 ] 

Hudson commented on HDFS-7395:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212357#comment-14212357
 ] 

Hudson commented on HDFS-7385:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212363#comment-14212363
 ] 

Hudson commented on HDFS-7358:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1957 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1957/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7358) Clients may get stuck waiting when using ByteArrayManager


[ 
https://issues.apache.org/jira/browse/HDFS-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212388#comment-14212388
 ] 

Hudson commented on HDFS-7358:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7358. Clients may get stuck waiting when using ByteArrayManager. 
(szetszwo: rev 394ba94c5d2801fbc5d95c7872de28eed1eb)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/ByteArrayManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestHFlush.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/util/TestByteArrayManager.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java


 Clients may get stuck waiting when using ByteArrayManager
 -

 Key: HDFS-7358
 URL: https://issues.apache.org/jira/browse/HDFS-7358
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Fix For: 2.7.0

 Attachments: h7358_20141104.patch, h7358_20141104_wait_timeout.patch, 
 h7358_20141105.patch, h7358_20141106.patch, h7358_20141107.patch, 
 h7358_20141108.patch


 [~stack] reported that clients might get stuck waiting when using 
 ByteArrayManager; see [his 
 comments|https://issues.apache.org/jira/browse/HDFS-7276?focusedCommentId=14197036page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14197036].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7385) ThreadLocal used in FSEditLog class causes FSImage permission mess up


[ 
https://issues.apache.org/jira/browse/HDFS-7385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212382#comment-14212382
 ] 

Hudson commented on HDFS-7385:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7385. ThreadLocal used in FSEditLog class causes FSImage permission mess 
up. Contributed by jiangyu. (cnauroth: rev 
b0a41de68c5b08f534ca231293de053c0b0cbd5d)
* 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestEditLog.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogOp.java


 ThreadLocal used in FSEditLog class causes FSImage permission mess up
 -

 Key: HDFS-7385
 URL: https://issues.apache.org/jira/browse/HDFS-7385
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.4.0, 2.5.0
Reporter: jiangyu
Assignee: jiangyu
Priority: Blocker
 Fix For: 2.6.0

 Attachments: HDFS-7385.2.patch, HDFS-7385.patch


   We migrated our NameNodes from low configuration to high configuration 
 machines last week. Firstly,we  imported the current directory including 
 fsimage and editlog files from original ActiveNameNode to new ActiveNameNode 
 and started the New NameNode, then  changed the configuration of all 
 datanodes and restarted all of datanodes , then blockreport to new NameNodes 
 at once and send heartbeat after that.
Everything seemed perfect, but after we restarted Resoucemanager , 
 most of the users compained that their jobs couldn't be executed for the 
 reason of permission problem.
   We applied Acls in our clusters, and after migrated we found most of 
 the directories and files which were not set Acls before now had the 
 properties of Acls. That is the reason why users could not execute their 
 jobs.So we had to change most of the files permission to a+r and directories 
 permission to a+rx to make sure the jobs can be executed.
 After searching this problem for some days, i found there is a bug in 
 FSEditLog.java. The ThreadLocal variable cache in FSEditLog don’t set the 
 proper value in logMkdir and logOpenFile functions. Here is the code of 
 logMkdir:
   public void logMkDir(String path, INode newNode) {
 PermissionStatus permissions = newNode.getPermissionStatus();
 MkdirOp op = MkdirOp.getInstance(cache.get())
   .setInodeId(newNode.getId())
   .setPath(path)
   .setTimestamp(newNode.getModificationTime())
   .setPermissionStatus(permissions);
 AclFeature f = newNode.getAclFeature();
 if (f != null) {
   op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
 }
 logEdit(op);
   }
   For example, if we mkdir with Acls through one handler(Thread indeed), 
 we set the AclEntries to the op from the cache. After that, if we mkdir 
 without any Acls setting and set through the same handler, the AclEnties from 
 the cache is the same with the last one which set the Acls, and because the 
 newNode have no AclFeature, we don’t have any chance to change it. Then the 
 editlog is wrong,record the wrong Acls. After the Standby load the editlogs 
 from journalnodes and  apply them to memory in SNN then savenamespace and 
 transfer the wrong fsimage to ANN, all the fsimages get wrong. The only 
 solution is to save namespace from ANN and you can get the right fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode

2014-11-14 Thread Kihwal Lee (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212394#comment-14212394
]

Kihwal Lee commented on HDFS-7396:
--

This is a general comment regarding reducing lock contention and increasing
concurrency in namenode. Whenever we experimented with improving concurrency,
the limiting factor was the garbage collection overhead. This has gotten worse
after the conversion to protobuf. Under a given load, locking improvements
will certainly give better and more predictable response times. But if pushed
beyond what NN was capable before, we will soon run into the existing
inefficiencies. [~daryn] has found some of them and I hope he shares them with
us soon.

As [~tlipcon] mentioned in HDFS-2206, we need locking rules defined, documented
and enforced if possible. In addition to the interactions between different
locks, the role and scope of each lock need to be clearly defined. Lock
definitions should include what it is protecting and expected data consistency
and visibility during and after, etc.At minimum, we can come up with a
comment template for this.

Revisit synchronization in Namenode
---

Key: HDFS-7396
URL: https://issues.apache.org/jira/browse/HDFS-7396
Project: Hadoop HDFS
Issue Type: Improvement
Components: namenode
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze

HDFS-2106 separated block management to a new package from namenode. As part
of it, some code was refactored to new classes such as DatanodeManager,
HeartbeatManager, etc. There are opportunities for improve locking in
namenode while currently the synchronization in namenode is mainly done by a
single global FSNamesystem lock.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7395) BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit


[ 
https://issues.apache.org/jira/browse/HDFS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212384#comment-14212384
 ] 

Hudson commented on HDFS-7395:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #5 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/5/])
HDFS-7395. BlockIdManager#clear() bails out when resetting the 
GenerationStampV1Limit. Contributed by Haohui Mai. (wheat9: rev 
1a2e5cbc4dbed527fdbefc09abc1faaacf3da285)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockIdManager.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 BlockIdManager#clear() bails out when resetting the GenerationStampV1Limit
 --

 Key: HDFS-7395
 URL: https://issues.apache.org/jira/browse/HDFS-7395
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Yongjun Zhang
Assignee: Haohui Mai
 Fix For: 2.7.0

 Attachments: HDFS-7395.000.patch


 In latest jenkins job
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1932/
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1931/
 but not 
 https://builds.apache.org/job/Hadoop-Hdfs-trunk/1930/
 The following test failed the same way:
 {code}
 Failed
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image
 Failing for the past 2 builds (Since Failed#1931 )
 Took 0.54 sec.
 Stacktrace
 java.lang.IllegalStateException: null
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:129)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.setGenerationStampV1Limit(BlockIdManager.java:85)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockIdManager.clear(BlockIdManager.java:206)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.clear(FSNamesystem.java:622)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:667)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:376)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:268)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:991)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:763)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:747)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1443)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1104)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:975)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:804)
   at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:465)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:424)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.upgradeAndVerify(TestDFSUpgradeFromImage.java:582)
   at 
 org.apache.hadoop.hdfs.TestDFSUpgradeFromImage.testUpgradeFromCorruptRel22Image(TestDFSUpgradeFromImage.java:318)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7396) Revisit synchronization in Namenode


[ 
https://issues.apache.org/jira/browse/HDFS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212469#comment-14212469
 ] 

Chris Nauroth commented on HDFS-7396:
-

bq. Whenever we experimented with improving concurrency, the limiting factor 
was the garbage collection overhead.

I also would be interested in seeing more information on this.  We've been 
updating our recommendations for garbage collection tuning recently.  It would 
be interesting for us to compare notes.

I'm also curious if you've tried any experiments running with the G1 collector. 
 I haven't tried it in several years.  When I tried it, it was still very 
experimental, so I ended up hitting too many bugs to run it in production.  
Perhaps it has stabilized by now.

 Revisit synchronization in Namenode
 ---

 Key: HDFS-7396
 URL: https://issues.apache.org/jira/browse/HDFS-7396
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze

 HDFS-2106 separated block management to a new package from namenode.  As part 
 of it, some code was refactored to new classes such as DatanodeManager, 
 HeartbeatManager, etc.  There are opportunities for improve locking in 
 namenode while currently the synchronization in namenode is mainly done by a 
 single global FSNamesystem lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7398) Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit

2014-11-14 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212525#comment-14212525
 ] 

Gera Shegalov commented on HDFS-7398:
-

Regarding the findbug warning:
{quote}
Inconsistent synchronization of 
org.apache.hadoop.hdfs.DFSOutputStream$Packet.dataPos; locked 83% of time
{quote}
It's obviously unrelated. 

 Reset cached thread-local FSEditLogOp's on every FSEditLog#logEdit
 --

 Key: HDFS-7398
 URL: https://issues.apache.org/jira/browse/HDFS-7398
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.6.0
Reporter: Gera Shegalov
Assignee: Gera Shegalov
 Attachments: HDFS-7398.v01.patch


 This is a follow-up on HDFS-7385.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212548#comment-14212548
 ] 

Chris Nauroth commented on HDFS-7384:
-

Hi, [~vinayrpet].  The current behavior of {{getAclStatus}} is an intentional 
design choice, but the history behind that choice is a bit convoluted.  Let me 
see if I can reconstruct it here.

It starts with HADOOP-10220, which added an ACL indicator bit to 
{{FsPermission}}.  This was provided as an optimization so that clients could 
quickly identify if a file has an ACL, without needing an additional RPC.

Later, objections were raised against the ACL bit in HDFS-5923 and HDFS-5932.  
We made a decision to roll back the HADOOP-10220 changes, and instead require 
callers to use {{getAclStatus}} to identify the presence of an ACL.  Prior to 
this, early implementations of {{getAclStatus}} would always return a non-empty 
list.  For an inode with no ACL, it would return the minimal ACL containing 
the 3 entries that correspond to basic POSIX permissions.  However, at this 
point, it became helpful to change {{getAclStatus}} so that it would return an 
empty list if there is no ACL.  This was seen as easier for clients than trying 
to check the entries for no ACL/minimal ACL.  It was also seen as a cleaner 
logical separation, since the client likely already has the {{FsPermission}} 
prior to calling {{getAclStatus}}, and therefore it would not be helpful to 
return redundant ACL entries.

Finally, HDFS-6326 identified that our implementation choice was 
backwards-incompatible for webhdfs, and generally a performance bottleneck for 
shell users.  To solve this, we reinstated the ACL bit, in a slightly different 
implementation, but the behavior of {{getAclStatus}} remained the same.

You've definitely identified a weakness in the current API design, and I raised 
similar objections at the time.  It's a trade-off.  I think there is good 
logical separation right now, but as a side effect, it does mean that callers 
may need some extra client-side logic to piece all of the information together, 
such as if someone wanted to write a custom GUI consuming WebHDFS to display 
ACL information.

At this point, we can't change the behavior of {{getAclStatus}} on the 2.x line 
for compatibility reasons.  Suppose a 2.6.0 deployment of the shell called 
{{getAclStatus}} on a 2.7.0 NameNode, and it had been changed to return the 
complete ACL.  This would cause {{getfacl}} to display duplicate entries, 
because the 2.6.0 logic of {{GetfaclCommand}} and 
{{AclUtil#getAclFromPermAndEntries}} will combine the output of 
{{getAclStatus}} with the {{FsPermission}}, resulting in 3 duplicate entries.

Where does that leave us for this jira?  I can see the following options:
# Resolve as won't fix, based on the above rationale.
# Target 3.0 for a backwards-incompatible change.
# Add a new RPC, named {{getFullAcl}} or similar, with the behavior that you 
proposed.  However, I'd prefer not to increase the API footprint unless there 
is a really strong use case.

Hope this helps.  Let me know your thoughts.  Thanks!

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B

 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return


[ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212555#comment-14212555
 ] 

Chris Nauroth commented on HDFS-7177:
-

Hi, [~zhz].  I just realized too late that HDFS-7384 is reporting basically the 
same thing as this.  I just entered a huge comment on HDFS-7384 about it, so 
I'd prefer to resolve this one as duplicate, even though it really came first.  
I'll add all of the watchers over to HDFS-7384 so that they can still be 
involved in the conversation.  Thanks!

 Add an option to include minimal ACL in getAclStatus return
 ---

 Key: HDFS-7177
 URL: https://issues.apache.org/jira/browse/HDFS-7177
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Zhe Zhang
Assignee: Zhe Zhang
Priority: Minor

 Currently the 3 minimal ACL entries are not included in the returned value of 
 getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
 item.stat.getPermission();}}). It'd be useful to make it optional to include 
 them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7177) Add an option to include minimal ACL in getAclStatus return


 [ 
https://issues.apache.org/jira/browse/HDFS-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-7177.
-
Resolution: Duplicate

 Add an option to include minimal ACL in getAclStatus return
 ---

 Key: HDFS-7177
 URL: https://issues.apache.org/jira/browse/HDFS-7177
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Zhe Zhang
Assignee: Zhe Zhang
Priority: Minor

 Currently the 3 minimal ACL entries are not included in the returned value of 
 getAclStatus. {{FsShell}} gets them separately ({{FsPermission perm = 
 item.stat.getPermission();}}). It'd be useful to make it optional to include 
 them, so that external programs can get a complete view of the permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7056) Snapshot support for truncate

2014-11-14 Thread Plamen Jeliazkov (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212588#comment-14212588
 ] 

Plamen Jeliazkov commented on HDFS-7056:


FindBugs appears to be unrelated.

New FindBugs points to inconsistent synchronization in 
org.apache.hadoop.hdfs.DFSOutputStream. 
A class we don't touch in this work.

 Snapshot support for truncate
 -

 Key: HDFS-7056
 URL: https://issues.apache.org/jira/browse/HDFS-7056
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: 3.0.0
Reporter: Konstantin Shvachko
Assignee: Plamen Jeliazkov
 Attachments: HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-3107-HDFS-7056-combined.patch, HDFS-3107-HDFS-7056-combined.patch, 
 HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, HDFS-7056.patch, 
 HDFS-7056.patch, HDFS-7056.patch, HDFSSnapshotWithTruncateDesign.docx


 Implementation of truncate in HDFS-3107 does not allow truncating files which 
 are in a snapshot. It is desirable to be able to truncate and still keep the 
 old file state of the file in the snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6962) ACLs inheritance conflict with umaskmode


 [ 
https://issues.apache.org/jira/browse/HDFS-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6962:

Target Version/s: 2.7.0  (was: 2.4.1)

Hello, [~Alexandre LINTE].  Thank you for filing this issue.  I tested the same 
scenario against a Linux local file system, and I confirmed that HDFS is 
showing different behavior, just like you described.

I also confirmed that this is a divergence from the POSIX ACL specs.  Here is a 
quote of the relevant section:

{quote}
The permissions of inherited access ACLs are further modified by the mode 
parameter that each system call creating file system objects has. The mode 
parameter contains nine permission bits that stand for the permissions of the 
owner, group, and other class permissions. The effective permissions of each 
class are set to the intersection of the permissions defined for this class in 
the ACL and specified in the mode parameter.

If the parent directory has no default ACL, the permissions of the new file are 
determined as defined in POSIX.1. The effective permissions are set to the 
permissions defined in the mode parameter, minus the permissions set in the 
current umask.

The umask has no effect if a default ACL exists.
{quote}

Changing this behavior is going to be somewhat challenging.  Note the 
distinction made in the spec between mode and umask.  When creating a new child 
(file or directory) of a directory with a default ACL, the mode influences the 
inherited access ACL entries, but the umask has no effect.  Unfortunately, our 
current implementation intersects mode and umask on the client side before 
passing them to the NameNode in the RPC.  This happens in {{DFSClient#mkdirs}} 
and {{DFSClient#create}}:

{code}
  public boolean mkdirs(String src, FsPermission permission,
  boolean createParent) throws IOException {
if (permission == null) {
  permission = FsPermission.getDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

{code}
  public DFSOutputStream create(String src, 
 FsPermission permission,
 EnumSetCreateFlag flag, 
 boolean createParent,
 short replication,
 long blockSize,
 Progressable progress,
 int buffersize,
 ChecksumOpt checksumOpt,
 InetSocketAddress[] favoredNodes) throws 
IOException {
checkOpen();
if (permission == null) {
  permission = FsPermission.getFileDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
{code}

On the NameNode side, when it copies the default ACL from parent to child, 
we've lost the information.  We just have a single piece of permissions data, 
with no knowledge of what was the mode vs. the umask on the client side.

A potential solution is to push both mode and umask explicitly to the NameNode 
in the RPC requests for {{MkdirsRequestProto}} and {{CreateRequestProto}}.  
Those messages already contain an instance of {{FsPermissionProto}}.  We could 
add a second optional instance.  If both instances are defined, then the 
NameNode would interpret one as being mode and the other as being umask.  There 
would still be a possibility of an older client still passing just one 
instance, and in that case, we'd have to fall back to the current behavior.  
It's a bit messy, but it could work.

We also have one additional problem specific to the shell for files (not 
directories).  The implementation of copyFromLocal breaks down into 2 separate 
RPCs: creating the file, followed by a separate chmod call.  The NameNode has 
no way of knowing if that chmod call is part of a copyFromLocal or not though.  
It's too late to enforce the mode vs. umask distinction.

I'm tentatively targeting this to 2.7.0.  I think this will need more 
investigation to make sure there are no compatibility issues with the solution. 
 If there is an unavoidable compatibility problem, then it might require 
pushing out to 3.x.  We won't know for sure until someone starts coding.

Thank you again for the very detailed bug report.

 ACLs inheritance conflict with umaskmode
 

 Key: HDFS-6962
 URL: https://issues.apache.org/jira/browse/HDFS-6962
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: security
Affects Versions: 2.4.1
 Environment: CentOS release 6.5 (Final)
Reporter: LINTE
  Labels: hadoop, security

 In hdfs-site.xml 
 property
 namedfs.umaskmode/name
 value027/value
 /property
 1/ Create a directory as superuser
 bash# hdfs dfs -mkdir  /tmp/ACLS
 2/ set default ACLs on this directory rwx access for group readwrite and user

[jira] [Resolved] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk


 [ 
https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang resolved HDFS-7393.
---
Resolution: Duplicate

I think this is a dupe of HDFS-7395, has the same Precondition stacktrace in 
BlockIdManager. Please reopen if it's still the case.

 TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
 ---

 Key: HDFS-7393
 URL: https://issues.apache.org/jira/browse/HDFS-7393
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Ted Yu

 The following is reproducible:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-11-14 Thread Kihwal Lee (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212670#comment-14212670
]

Kihwal Lee commented on HDFS-6133:
--

This might be outside the scope of this jira, but I think we need to think
about this before going further. If a node with a pinned block is temporarily
unavailable, the namenode will try to replicate the block as it is
under-replicated. When the node recovers, the block is over-replicated and a
replica will be invalidated. How do we make sure it is not removed from the
favored node? I think this scenario can happen during start-up or transient
infra/network issues.

Daryn and I had a brief discussion about this. It might be possible to include
pinning info in block reports and remember it in block manager. This will
enable NN to make the right decision on over-replicated cases. A bit more
complicated logic will be needed when a pinned block gets corrupted on a
favored node. The usual replicate + invalidate strategy won't be ideal here.

Make Balancer support exclude specified path

Key: HDFS-6133
URL: https://issues.apache.org/jira/browse/HDFS-6133
Project: Hadoop HDFS
Issue Type: Improvement
Components: balancer mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch,
HDFS-6133.patch

Currently, run Balancer will destroying Regionserver's data locality.
If getBlocks could exclude blocks belongs to files which have specific path
prefix, like /hbase, then we can run Balancer without destroying
Regionserver's data locality.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212680#comment-14212680
 ] 

Vinayakumar B commented on HDFS-7384:
-

Thanks [~cnauroth] for the detailed explanation.

bq. At this point, we can't change the behavior of getAclStatus on the 2.x line 
for compatibility reasons. Suppose a 2.6.0 deployment of the shell called 
getAclStatus on a 2.7.0 NameNode
Here we can implement this without breaking compatibility.
For ex: returned {{AclStatus}} can have default permissions in form of 
{{FsPermission}} object itself, which would be optional field in protobuf.
So We can keep {{getAclEntries()}} return value as is, but in {{AclEntry}} we 
can add one more field, 'effective action', either this can be calculated at 
client side itself, based on the FsPermission object in AclStatus, or can be 
optional field set at NN side itself.

My basic intention is to avoid extra client side logic, which currently users 
have to do, to find out the effective permission for an ACL entry.

If {{AclStatus}} contains {{FsPermission}} value, then we can create the same 
output as 'getfacl' without having to do one more RPC to NN. This would keep 
the existing behavior of returning empty entries for basic permissions, which 
was decided after so many discussions. 

Any thoughts?

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B

 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212699#comment-14212699
 ] 

Andrew Wang commented on HDFS-7374:
---

The patch looks good, findbugs looks unrelated, but the TestDecommission 
failure is worrying and also failed for me locally. Could you take a look?

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-4239) Means of telling the datanode to stop using a sick disk

[
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yongjun Zhang resolved HDFS-4239.
-
Resolution: Duplicate

Hi Stack,

This issue turned out to be a duplicate of HDFS-1362, which is resolved now.

I'm closing this jira as duplicate. Please re-open if you think there is
additional issue to be addressed there.

Thanks.

Means of telling the datanode to stop using a sick disk
---

Key: HDFS-4239
URL: https://issues.apache.org/jira/browse/HDFS-4239
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: stack
Assignee: Yongjun Zhang
Attachments: hdfs-4239.patch, hdfs-4239_v2.patch, hdfs-4239_v3.patch,
hdfs-4239_v4.patch, hdfs-4239_v5.patch

If a disk has been deemed 'sick' -- i.e. not dead but wounded, failing
occasionally, or just exhibiting high latency -- your choices are:
1. Decommission the total datanode. If the datanode is carrying 6 or 12
disks of data, especially on a cluster that is smallish -- 5 to 20 nodes --
the rereplication of the downed datanode's data can be pretty disruptive,
especially if the cluster is doing low latency serving: e.g. hosting an hbase
cluster.
2. Stop the datanode, unmount the bad disk, and restart the datanode (You
can't unmount the disk while it is in use). This latter is better in that
only the bad disk's data is rereplicated, not all datanode data.
Is it possible to do better, say, send the datanode a signal to tell it stop
using a disk an operator has designated 'bad'. This would be like option #2
above minus the need to stop and restart the datanode. Ideally the disk
would become unmountable after a while.
Nice to have would be being able to tell the datanode to restart using a disk
after its been replaced.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212734#comment-14212734
 ] 

Chris Nauroth commented on HDFS-7384:
-

Yes, what you described makes sense.  An older client simply wouldn't consume 
the new protobuf field.

I'd prefer not add the effective action directly to {{AclEntry}}, since the 
effective action is something that only makes sense when the entry is 
considered against some other object (the mask).

Overall, it sounds good.  Thanks for thinking this through and putting out the 
proposal!

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B

 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.


 [ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-6711.
-
Resolution: Duplicate

 FSNamesystem#getAclStatus does not write to the audit log.
 --

 Key: HDFS-6711
 URL: https://issues.apache.org/jira/browse/HDFS-6711
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Chris Nauroth
Priority: Minor

 Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6711) FSNamesystem#getAclStatus does not write to the audit log.


[ 
https://issues.apache.org/jira/browse/HDFS-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212736#comment-14212736
 ] 

Chris Nauroth commented on HDFS-6711:
-

This was fixed in HDFS-7218, so I'm resolving this as duplicate.

 FSNamesystem#getAclStatus does not write to the audit log.
 --

 Key: HDFS-6711
 URL: https://issues.apache.org/jira/browse/HDFS-6711
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0, 2.4.0
Reporter: Chris Nauroth
Priority: Minor

 Consider writing an event to the audit log for the {{getAclStatus}} method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-11-14 Thread Chris Trezzo (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212745#comment-14212745
 ] 

Chris Trezzo commented on HDFS-6133:


I am slightly late to the party on this one, but at Twitter we have a similar 
need for a slightly different use case. We use federation and each block pool 
has a different workload. For some of these workloads it does not make sense to 
run the balancer. For example, we have a block pool associated with a tmp 
namespace. Ideally, we would never want to balance blocks in that block pool 
because they will be deleted shortly anyways. One design approach we were 
contemplating is to make the balancer block pool-aware. You could then run the 
balancer on a per-block pool basis and have pluggable balancing strategies for 
each pool (i.e. the balancer policy in the block pool associated with the tmp 
namespace is a no-op). This allows the balancer to continue to be decoupled 
with the namespace and only needs to know about the block pool (we can still 
separate the BM at a later point).

The above might work for this use case as well. The balancer policy for the 
block pool containing blocks in hbase would be a no-op. Let me know what you 
guys think. I can see the block pool design being orthogonal to this JIRA, so 
let me know if I should open up a separate JIRA for this effort. We could 
potentially use the pinning strategy for our use case as well, but I hesitate 
for the same reasons that [~kihwal] mentioned above with respect to 
corrupt/unavailable blocks.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
 HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4239) Means of telling the datanode to stop using a sick disk

[
https://issues.apache.org/jira/browse/HDFS-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212757#comment-14212757
]

Yongjun Zhang commented on HDFS-4239:
-

HI [~qwertymaniac],

My bad that I did not notice your earlier comment
{quote}
I just noticed Steve's comment referring the same - should've gone through
properly before spending google cycles. I feel HDFS-1362 implemented would
solve half of this - and the other half would be to make the removals
automatic. Right now the checkDiskError does not eject if its slow - as long as
its succeed, which would have to be done via this JIRA I think. The re-add
would be possible via HDFS-1362.
{quote}
until now. So we need to use the functionality provided by HDFS-1362 to
automatically remove a sick disk. It seems the original goal of HDFS-4239 is
the same as HDFS-1362 (right?), and we can create a new jira for automatically
removing a sick disk?

Thanks.

Means of telling the datanode to stop using a sick disk
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212766#comment-14212766
 ] 

Andrew Wang commented on HDFS-7374:
---

Hey [~mingma], I was looking a bit more at decom, and I see that we have this 
if statement at the end of {{isReplicationInProgress}}:

{code}
if (!status  !srcNode.isAlive) {
  LOG.warn(srcNode  + srcNode +  is dead  +
  when decommission is in progress. Continue to mark  +
  it as decommission in progress. In that way, when it rejoins the  +
  cluster it can continue the decommission process.);
  status = true;
}
{code}

Logically, a (DEAD, DECOM_IN_PROGRESS) should be able to go to (DEAD, DECOMMED) 
if all of its blocks are fully replicated, but this if statement prevents 
{{isReplicationInProgress}} from ever returning false for a dead node. It seems 
like we can loosen this requirement?

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 stopped by Keith Pak.
---
 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work stopped] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 stopped by Keith Pak.
---
 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 started by Keith Pak.
---
 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak reassigned HDFS-7394:
---

Assignee: Keith Pak

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-7394 started by Keith Pak.
---
 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Patch Available  (was: Open)

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor

 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: HDFS-7394.patch

Attached patch

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212794#comment-14212794
 ] 

Ming Ma commented on HDFS-7374:
---

[~andrew.wang], after a node is dead, all its blocks will be removed from 
blockmap. So if the node no longer joins the cluster, it isn't unclear how you 
can tell if all its blocks are fully replicated unless we track those blocks.

Another way to cover all these scenarios could be to get rid of {{DEAD, 
DECOM_IN_PROGRESS}} state. After the node is dead during decommission, 
transition to {{DEAD, DECOMMED}}. When the node rejoins the cluster, transition 
it to {{LIVE, DECOM_IN_PROGRESS}}.

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212800#comment-14212800
]

Brandon Li commented on HDFS-7146:
--

The patch looks nice. Some comments:
1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile
2. Do we need checkSupportedPlatform()? We don’t have to limit the platform
with only linux and mac. Some other UNIX flavors might also be able to run the
NFS server. We could do the follow:
if (Shell.Mac) {
// mac command
} else {
// linux command for everything else
}
3. do we still need constructFullMapAtInit since it’s always false?

NFS ID/Group lookup requires SSSD enumeration on the server
---

Key: HDFS-7146
URL: https://issues.apache.org/jira/browse/HDFS-7146
Project: Hadoop HDFS
Issue Type: Bug
Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch,
HDFS-7146.003.patch, HDFS-7146.004.patch

The current implementation of the NFS UID and GID lookup works by running
'getent passwd' with an assumption that it will return the entire list of
users available on the OS, local and remote (AD/etc.).
This behaviour of the command is advised to be and is prevented by
administrators in most secure setups to avoid excessive load to the ADs
involved, as the # of users to be listed may be too large, and the repeated
requests of ALL users not present in the cache would be too much for the AD
infrastructure to bear.
The NFS server should likely do lookups based on a specific UID request, via
'getent passwd UID', if the UID does not match a cached value. This reduces
load on the LDAP backed infrastructure.
Thanks [~qwertymaniac] for reporting the issue.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods


 [ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haohui Mai updated HDFS-7279:
-
Attachment: HDFS-7279.011.patch

 Use netty to implement DatanodeWebHdfsMethods
 -

 Key: HDFS-7279
 URL: https://issues.apache.org/jira/browse/HDFS-7279
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, webhdfs
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
 HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
 HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
 HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
 HDFS-7279.011.patch


 Currently the DN implements all related webhdfs functionality using jetty. As 
 the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
 and connection management, DN often suffers from long latency and OOM when 
 its webhdfs component is under sustained heavy load.
 This jira proposes to implement the webhdfs component in DN using netty, 
 which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Maysam Yabandeh (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212825#comment-14212825
 ] 

Maysam Yabandeh commented on HDFS-6982:
---

[~andrew.wang] I do not see any relation between the patch and the findbug 
warnings as well as the test failures.

 nntop: top-like tool for name node users
 -

 Key: HDFS-6982
 URL: https://issues.apache.org/jira/browse/HDFS-6982
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
 HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
 HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf


 In this jira we motivate the need for nntop, a tool that, similarly to what 
 top does in Linux, gives the list of top users of the HDFS name node and 
 gives insight about which users are sending majority of each traffic type to 
 the name node. This information turns out to be the most critical when the 
 name node is under pressure and the HDFS admin needs to know which user is 
 hammering the name node and with what kind of requests. Here we present the 
 design of nntop which has been in production at Twitter in the past 10 
 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K 
 nodes), low memory footprint (less than a few MB), and quite efficient for 
 the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212844#comment-14212844
 ] 

Yongjun Zhang commented on HDFS-7146:
-

HI [~brandonli],

Thanks a lot for the review and comments!

I have a few questions to clarify:
{quote}
1. it doesn’t seem to be necessary to introduce defaultStaticIdMappingFile
{quote}
The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I 
actually have removed it in rev 004. Would you please indicate the place you 
were looking at?

{quote}
2. Do we need checkSupportedPlatform()? We don’t have to limit the platform 
with only linux and mac. Some other UNIX flavors might also be able to run the 
NFS server. We could do the follow:
if (Shell.Mac)
{ // mac command }
else
{ // linux command for everything else }
{quote}

About checkSupportedPlatform, I simply followed the existing implementation ({{ 
if (!OS.startsWith(Linux)  !OS.startsWith(Mac))}}), which says only mac 
and linux are supported. Relaxing the platform support is a different issue to 
solve and it seems deserving a separate jira, what do you think?

{quote}
3. do we still need constructFullMapAtInit since it’s always false?
{quote}
I introduced this for testing purpose. If you look at the new test I 
introduced, it's first called with true to create a reference (refIdMapping). 
That's why I tagged constructor with @VisibleForTesting. Does this sound ok to 
you?

Thanks again!


 NFS ID/Group lookup requires SSSD enumeration on the server
 ---

 Key: HDFS-7146
 URL: https://issues.apache.org/jira/browse/HDFS-7146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
 HDFS-7146.003.patch, HDFS-7146.004.patch


 The current implementation of the NFS UID and GID lookup works by running 
 'getent passwd' with an assumption that it will return the entire list of 
 users available on the OS, local and remote (AD/etc.).
 This behaviour of the command is advised to be and is prevented by 
 administrators in most secure setups to avoid excessive load to the ADs 
 involved, as the # of users to be listed may be too large, and the repeated 
 requests of ALL users not present in the cache would be too much for the AD 
 infrastructure to bear.
 The NFS server should likely do lookups based on a specific UID request, via 
 'getent passwd UID', if the UID does not match a cached value. This reduces 
 load on the LDAP backed infrastructure.
 Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212846#comment-14212846
 ] 

Andrew Wang commented on HDFS-7374:
---

Hmm, so one situation we've seen is that the cluster is 100% healthy (no 
under-rep blocks) and dead DNs still get stuck in the D_I_P state. We can 
safely transition even dead nodes to DECOMMED in this situation.

Going backwards from (DEAD, DECOMMED) back to (LIVE, D_I_P) feels a little 
weird. IMO DECOMMED should mean that a node can safely be removed from the 
cluster, even for dead nodes. That won't necessarily be true in this case.

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7399) Lack of synchronization in DFSOutputStream#Packet#getLastByteOffsetBlock()

2014-11-14 Thread Ted Yu (JIRA)

Ted Yu created HDFS-7399:


 Summary: Lack of synchronization in 
DFSOutputStream#Packet#getLastByteOffsetBlock()
 Key: HDFS-7399
 URL: https://issues.apache.org/jira/browse/HDFS-7399
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


{code}
long getLastByteOffsetBlock() {
  return offsetInBlock + dataPos - dataStart;
{code}
Access to fields of Packet.this should be protected by synchronization as done 
in other methods such as writeTo().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7393) TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk

2014-11-14 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212962#comment-14212962
 ] 

Konstantin Shvachko commented on HDFS-7393:
---

Indeed. Good it is fixed. Thanks. 

 TestDFSUpgradeFromImage#testUpgradeFromCorruptRel22Image fails in trunk
 ---

 Key: HDFS-7393
 URL: https://issues.apache.org/jira/browse/HDFS-7393
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Ted Yu

 The following is reproducible:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-3806) Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState


 [ 
https://issues.apache.org/jira/browse/HDFS-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3806.
-
Resolution: Duplicate

I'm resolving this as duplicate of HDFS-3519.

 Assertion failed in TestStandbyCheckpoints.testBothNodesInStandbyState
 --

 Key: HDFS-3806
 URL: https://issues.apache.org/jira/browse/HDFS-3806
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
 Environment: Jenkins
Reporter: Trevor Robinson
Priority: Minor

 Failed in Jenkins build for unrelated issue (HDFS-3804): 
 https://builds.apache.org/job/PreCommit-HDFS-Build/3011/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestStandbyCheckpoints/testBothNodesInStandbyState/
 {noformat}
 java.lang.AssertionError: Expected non-empty 
 /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/trunk/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/name1/current/fsimage_012
   at org.junit.Assert.fail(Assert.java:91)
   at org.junit.Assert.assertTrue(Assert.java:43)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageTestUtil.assertNNHasCheckpoints(FSImageTestUtil.java:467)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HATestUtil.waitForCheckpoint(HATestUtil.java:213)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testBothNodesInStandbyState(TestStandbyCheckpoints.java:133)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212985#comment-14212985
]

Brandon Li commented on HDFS-7146:
--

{quote}
The defaulyStaticIdMappingFile was introduced in the HADOOP-11195, and I
actually have removed it in rev 004. Would you please indicate the place you
were looking at?{quote}
My bad. I looked into the wrong side of the diff.
{quote}Relaxing the platform support is a different issue to solve and it seems
deserving a separate jira, what do you think?{quote}
I am ok with either fixing it here or a different JIRA.
{quote}I introduced this for testing purpose. {quote}
Please add java doc for it. Also, it would to nice to add the solution in the
class java doc.

NFS ID/Group lookup requires SSSD enumeration on the server
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


[ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212998#comment-14212998
 ] 

Hadoop QA commented on HDFS-7394:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681607/HDFS-7394.patch
  against trunk revision 1a1dcce.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1218 javac 
compiler warnings (more than the trunk's current 1217 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.ipc.TestRPCCallBenchmark

  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8741//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8741//console

This message is automatically generated.

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212999#comment-14212999
 ] 

Hadoop QA commented on HDFS-7279:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681609/HDFS-7279.011.patch
  against trunk revision f2fe8a8.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-hdfs-project/hadoop-hdfs 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8742//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8742//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8742//console

This message is automatically generated.

 Use netty to implement DatanodeWebHdfsMethods
 -

 Key: HDFS-7279
 URL: https://issues.apache.org/jira/browse/HDFS-7279
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, webhdfs
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
 HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
 HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
 HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
 HDFS-7279.011.patch


 Currently the DN implements all related webhdfs functionality using jetty. As 
 the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
 and connection management, DN often suffers from long latency and OOM when 
 its webhdfs component is under sustained heavy load.
 This jira proposes to implement the webhdfs component in DN using netty, 
 which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213010#comment-14213010
 ] 

Haohui Mai commented on HDFS-7279:
--

The findbugs warning is unrelated.

 Use netty to implement DatanodeWebHdfsMethods
 -

 Key: HDFS-7279
 URL: https://issues.apache.org/jira/browse/HDFS-7279
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, webhdfs
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
 HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
 HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
 HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
 HDFS-7279.011.patch


 Currently the DN implements all related webhdfs functionality using jetty. As 
 the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
 and connection management, DN often suffers from long latency and OOM when 
 its webhdfs component is under sustained heavy load.
 This jira proposes to implement the webhdfs component in DN using netty, 
 which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Patch Available  (was: Open)

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: HDFS-7394.patch

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Status: Open  (was: Patch Available)

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


 [ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Pak updated HDFS-7394:

Attachment: (was: HDFS-7394.patch)

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213109#comment-14213109
]

Ravi Prakash commented on HDFS-4882:

These test errors are valid. They are happening because pollFirst() retrieved
*and removes* the first element. Sorry for the oversight. Will upload a new
patch soon

Namenode LeaseManager checkLeases() runs into infinite loop
---

Key: HDFS-4882
URL: https://issues.apache.org/jira/browse/HDFS-4882
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch,
HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.patch

Scenario:
1. cluster with 4 DNs
2. the size of the file to be written is a little more than one block
3. write the first block to 3 DNs, DN1-DN2-DN3
4. all the data packets of first block is successfully acked and the client
sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
5. DN2 and DN3 are down
6. client recovers the pipeline, but no new DN is added to the pipeline
because of the current pipeline stage is PIPELINE_CLOSE
7. client continuously writes the last block, and try to close the file after
written all the data
8. NN finds that the penultimate block doesn't has enough replica(our
dfs.namenode.replication.min=2), and the client's close runs into indefinite
loop(HDFS-2936), and at the same time, NN makes the last block's state to
COMPLETE
9. shutdown the client
10. the file's lease exceeds hard limit
11. LeaseManager realizes that and begin to do lease recovery by call
fsnamesystem.internalReleaseLease()
12. but the last block's state is COMPLETE, and this triggers lease manager's
infinite loop and prints massive logs like this:
{noformat}
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease. Holder:
DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
limit
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease.
Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
/user/h_wuzesheng/test.dat
2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block
blk_-7028017402720175688_1202597,
lastBLockState=COMPLETE
2013-06-05,17:42:25,695 INFO
org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery
for file /user/h_wuzesheng/test.dat lease [Lease. Holder: DFSClient_NONM
APREDUCE_-1252656407_1, pendingcreates: 1]
{noformat}
(the 3rd line log is a debug log added by us)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop

2014-11-14 Thread Ravi Prakash (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ravi Prakash updated HDFS-4882:
---
Attachment: HDFS-4882.4.patch

Here's a patch which goes back to using sortedLeases.first() .

Namenode LeaseManager checkLeases() runs into infinite loop
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

Ming Ma created HDFS-7400:
-

 Summary: More reliable namenode health check to detect OS/HW issues
 Key: HDFS-7400
 URL: https://issues.apache.org/jira/browse/HDFS-7400
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ming Ma


We had this scenario on an active NN machine.

* Disk array controller firmware has a bug. So disks stop working.
* ZKFC and NN still considered the node healthy; Communications between ZKFC 
and ZK as well as ZKFC and NN are good.
* The machine can be pinged.
* The machine can't be sshed.

So all clients and DNs can't use the NN. But ZKFC and NN still consider the 
node healthy.

The question is how we can have ZKFC and NN detect such OS/HW specific issues 
quickly? Some ideas we discussed briefly,

* Have other machines help to make the decision whether the NN is actually 
healthy. Then you have to figure out to make the decision accurate in the case 
of network issue, etc.
* Run OS/HW health check script external to ZKFC/NN on the same machine. If it 
detects disk or other issues, it can reboot the machine for example.
* Run OS/HW health check script inside ZKFC/NN. For example NN's 
HAServiceProtocol#monitorHealth can be modified to call such health check 
script.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server


 [ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-7146:

Attachment: HDFS-7146.005.patch

 NFS ID/Group lookup requires SSSD enumeration on the server
 ---

 Key: HDFS-7146
 URL: https://issues.apache.org/jira/browse/HDFS-7146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
 HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch


 The current implementation of the NFS UID and GID lookup works by running 
 'getent passwd' with an assumption that it will return the entire list of 
 users available on the OS, local and remote (AD/etc.).
 This behaviour of the command is advised to be and is prevented by 
 administrators in most secure setups to avoid excessive load to the ADs 
 involved, as the # of users to be listed may be too large, and the repeated 
 requests of ALL users not present in the cache would be too much for the AD 
 infrastructure to bear.
 The NFS server should likely do lookups based on a specific UID request, via 
 'getent passwd UID', if the UID does not match a cached value. This reduces 
 load on the LDAP backed infrastructure.
 Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213165#comment-14213165
 ] 

Yongjun Zhang commented on HDFS-7146:
-

HI [~brandonli],

Nice idea to add some java docs and describing the solution, I just uploaded 
005 to have that.

Thanks for your flexibility, I will create a separate jira for the platform 
coverage issue, 'cause I think that may involve looking into multiple places 
for platform differences.

Thanks for taking a further look.


 NFS ID/Group lookup requires SSSD enumeration on the server
 ---

 Key: HDFS-7146
 URL: https://issues.apache.org/jira/browse/HDFS-7146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
 HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch


 The current implementation of the NFS UID and GID lookup works by running 
 'getent passwd' with an assumption that it will return the entire list of 
 users available on the OS, local and remote (AD/etc.).
 This behaviour of the command is advised to be and is prevented by 
 administrators in most secure setups to avoid excessive load to the ADs 
 involved, as the # of users to be listed may be too large, and the repeated 
 requests of ALL users not present in the cache would be too much for the AD 
 infrastructure to bear.
 The NFS server should likely do lookups based on a specific UID request, via 
 'getent passwd UID', if the UID does not match a cached value. This reduces 
 load on the LDAP backed infrastructure.
 Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213177#comment-14213177
 ] 

Andrew Wang commented on HDFS-6982:
---

Hi Maysam, took a look at the latest patch, I think we're almost there :) Just 
minor comments. Hopefully Jenkins behaves with the next rev too, I agree it 
looks unrelated or garbled.

DFSConfigKeys / TopConf:
* Need to rename the DFSConfigKeys variable names to reflect new config names
* Seems like I gave bad advice about getInts, since it doesn't have a way of 
taking a default, so right now if we try to turn it off, it'll set the default. 
Reverting to what you had is cool, though adding a getInts that takes a default 
would be appreciated.

RWManager:
* Could we add explanatory text to the Precondition checks?

AuditLogger:
* Rather than injecting it into the conf (kinda brittle), what I had in mind 
was in FSNamesystem#initAuditLoggers, we could tack it on the end after adding 
the ones from the conf. No need for reflection :)
* Related to this, it'd be good to have a unit test that disables nntop and 
then checks that the audit logger isn't added and that metrics aren't 
published. Feel free to add a @VisibleForTesting getter if it helps.

Nits:
* Unused import in NameNode

This is just minor stuff though, I'm +1 pending the above review comments.

 nntop: top-like tool for name node users
 -

 Key: HDFS-6982
 URL: https://issues.apache.org/jira/browse/HDFS-6982
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
 HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
 HDFS-6982.v7.patch, HDFS-6982.v8.patch, nntop-design-v1.pdf


 In this jira we motivate the need for nntop, a tool that, similarly to what 
 top does in Linux, gives the list of top users of the HDFS name node and 
 gives insight about which users are sending majority of each traffic type to 
 the name node. This information turns out to be the most critical when the 
 name node is under pressure and the HDFS admin needs to know which user is 
 hammering the name node and with what kind of requests. Here we present the 
 design of nntop which has been in production at Twitter in the past 10 
 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K 
 nodes), low memory footprint (less than a few MB), and quite efficient for 
 the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7386) Replace check port number 1024 with shared isPrivilegedPort method


 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

 Component/s: security
  datanode
Target Version/s: 2.7.0
Hadoop Flags: Reviewed

+1 for the patch.  I agree that the test failures are unrelated.  I saw the 
same thing that you saw when I reran locally.  I'll commit this.

 Replace check port number  1024 with shared isPrivilegedPort method 
 ---

 Key: HDFS-7386
 URL: https://issues.apache.org/jira/browse/HDFS-7386
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, security
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Priority: Trivial
 Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch


 Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
 check port number  1024 with shared isPrivilegedPort method.
 Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server

2014-11-14 Thread Brandon Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213188#comment-14213188
 ] 

Brandon Li commented on HDFS-7146:
--

+1. Pending Jenkins.

 NFS ID/Group lookup requires SSSD enumeration on the server
 ---

 Key: HDFS-7146
 URL: https://issues.apache.org/jira/browse/HDFS-7146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
 HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch


 The current implementation of the NFS UID and GID lookup works by running 
 'getent passwd' with an assumption that it will return the entire list of 
 users available on the OS, local and remote (AD/etc.).
 This behaviour of the command is advised to be and is prevented by 
 administrators in most secure setups to avoid excessive load to the ADs 
 involved, as the # of users to be listed may be too large, and the repeated 
 requests of ALL users not present in the cache would be too much for the AD 
 infrastructure to bear.
 The NFS server should likely do lookups based on a specific UID request, via 
 'getent passwd UID', if the UID does not match a cached value. This reduces 
 load on the LDAP backed infrastructure.
 Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7386) Replace check port number 1024 with shared isPrivilegedPort method


 [ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-7386:

   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

I committed this to trunk and branch-2.  Yongjun, thank you for improving this 
part of the code.

 Replace check port number  1024 with shared isPrivilegedPort method 
 ---

 Key: HDFS-7386
 URL: https://issues.apache.org/jira/browse/HDFS-7386
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, security
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Priority: Trivial
 Fix For: 2.7.0

 Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch


 Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
 check port number  1024 with shared isPrivilegedPort method.
 Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-3749) Disable check for jsvc on windows


 [ 
https://issues.apache.org/jira/browse/HDFS-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved HDFS-3749.
-
Resolution: Won't Fix

This is no longer required, because HDFS-2856 has been implemented, providing 
SASL as a means to authenticate the DataNode instead of jsvc/privileged ports.  
I'm resolving this as Won't Fix.

 Disable check for jsvc on windows
 -

 Key: HDFS-3749
 URL: https://issues.apache.org/jira/browse/HDFS-3749
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Owen O'Malley
Assignee: Owen O'Malley
 Attachments: hdfs-3749-trunk.patch, hdfs-3749.patch, hdfs-3749.patch


 Jsvc doesn't make sense on windows and thus we should not require the 
 datanode to start up under it on that platform.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7386) Replace check port number 1024 with shared isPrivilegedPort method


[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213193#comment-14213193
 ] 

Hudson commented on HDFS-7386:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6552 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6552/])
HDFS-7386. Replace check port number  1024 with shared isPrivilegedPort 
method. Contributed by Yongjun Zhang. (cnauroth: rev 
1925e2a4ae78ef4178393848b4d1d71b0f4a4709)
* 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/SecurityUtil.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/SecureDataNodeStarter.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferServer.java
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/sasl/SaslDataTransferClient.java


 Replace check port number  1024 with shared isPrivilegedPort method 
 ---

 Key: HDFS-7386
 URL: https://issues.apache.org/jira/browse/HDFS-7386
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, security
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Priority: Trivial
 Fix For: 2.7.0

 Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch


 Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
 check port number  1024 with shared isPrivilegedPort method.
 Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7401) Add block info to DFSInputStream' WARN message when it adds node to deadNodes

Ming Ma created HDFS-7401:
-

 Summary: Add block info to DFSInputStream' WARN message when it 
adds node to deadNodes
 Key: HDFS-7401
 URL: https://issues.apache.org/jira/browse/HDFS-7401
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ming Ma
Priority: Minor


Block info is missing in the below message

{noformat}
2014-11-14 03:59:00,386 WARN org.apache.hadoop.hdfs.DFSClient: Failed to 
connect to /xx.xx.xx.xxx:50010 for block, add to deadNodes and continue. 
java.io.IOException: Got error for OP_READ_BLOCK
{noformat}

The code
{noformat}
DFSInputStream.java
  DFSClient.LOG.warn(Failed to connect to  + targetAddr +  for block
+ , add to deadNodes and continue.  + ex, ex);
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

[
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213215#comment-14213215
]

Andrew Wang commented on HDFS-7400:
---

So in {{monitorHealth}} we do a basic check just to see if the NN has free disk
space. I'd be okay extending this to other checks related to disk health.

Out of curiosity, did your failure condition result in a situation where {{df}}
worked, but the disk was otherwise non-functional? I guess with no SSH it's a
little hard to check, but I wonder what we could add to {{monitorHealth}} to
detect this failure condition.

More reliable namenode health check to detect OS/HW issues
--

Key: HDFS-7400
URL: https://issues.apache.org/jira/browse/HDFS-7400
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ming Ma

We had this scenario on an active NN machine.
* Disk array controller firmware has a bug. So disks stop working.
* ZKFC and NN still considered the node healthy; Communications between ZKFC
and ZK as well as ZKFC and NN are good.
* The machine can be pinged.
* The machine can't be sshed.
So all clients and DNs can't use the NN. But ZKFC and NN still consider the
node healthy.
The question is how we can have ZKFC and NN detect such OS/HW specific issues
quickly? Some ideas we discussed briefly,
* Have other machines help to make the decision whether the NN is actually
healthy. Then you have to figure out to make the decision accurate in the
case of network issue, etc.
* Run OS/HW health check script external to ZKFC/NN on the same machine. If
it detects disk or other issues, it can reboot the machine for example.
* Run OS/HW health check script inside ZKFC/NN. For example NN's
HAServiceProtocol#monitorHealth can be modified to call such health check
script.
Thoughts?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7386) Replace check port number 1024 with shared isPrivilegedPort method


[ 
https://issues.apache.org/jira/browse/HDFS-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213224#comment-14213224
 ] 

Yongjun Zhang commented on HDFS-7386:
-

Thank you so much Chris!


 Replace check port number  1024 with shared isPrivilegedPort method 
 ---

 Key: HDFS-7386
 URL: https://issues.apache.org/jira/browse/HDFS-7386
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, security
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
Priority: Trivial
 Fix For: 2.7.0

 Attachments: HDFS-7386.001.patch, HDFS-7386.002.patch


 Per discussion in HDFS-7382, I'm filing this jira as a follow-up, to replace 
 check port number  1024 with shared isPrivilegedPort method.
 Thanks [~cnauroth] for the work on HDFS-7382 and suggestion there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213244#comment-14213244
 ] 

Ming Ma commented on HDFS-7374:
---

So maybe we can use if all blocks in the whole cluster are fully replicated 
instead of if all blocks of that dead node are fully replicated as the 
criteria to move that dead node to decommed state?

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213247#comment-14213247
 ] 

Andrew Wang commented on HDFS-7374:
---

Yea, precisely :) I don't know how realistic this is in an active cluster with 
lots of failing disks, but it'd fix it for some users at least.

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

2014-11-14 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213246#comment-14213246
]

Allen Wittenauer commented on HDFS-7400:

bq. Disk array controller firmware has a bug. So disks stop working.
...
bq. The machine can be pinged.
bq. The machine can't be sshed.

Was ssh actually opening the socket and just not completing the login process?

On the surface, this sounds like typical Linux IO weirdisms, but I want to make
sure.

bq. Out of curiosity, did your failure condition result in a situation where df
worked, but the disk was otherwise non-functional?

I keep thinking about the situation where there are two controllers but only
one went belly up. Doing things like df or even a write+read combo might not be
sufficient unless we do it across all devices. I suspect:

bq. Have other machines help to make the decision whether the NN is actually
healthy.

... might be the only truly viable solution under various failure modes.

More reliable namenode health check to detect OS/HW issues
--

Key: HDFS-7400
URL: https://issues.apache.org/jira/browse/HDFS-7400
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ming Ma

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7146) NFS ID/Group lookup requires SSSD enumeration on the server


[ 
https://issues.apache.org/jira/browse/HDFS-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213254#comment-14213254
 ] 

Hadoop QA commented on HDFS-7146:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681667/HDFS-7146.005.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs-nfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8746//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8746//console

This message is automatically generated.

 NFS ID/Group lookup requires SSSD enumeration on the server
 ---

 Key: HDFS-7146
 URL: https://issues.apache.org/jira/browse/HDFS-7146
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
Reporter: Yongjun Zhang
Assignee: Yongjun Zhang
 Attachments: HDFS-7146.001.patch, HDFS-7146.002.allIncremental.patch, 
 HDFS-7146.003.patch, HDFS-7146.004.patch, HDFS-7146.005.patch


 The current implementation of the NFS UID and GID lookup works by running 
 'getent passwd' with an assumption that it will return the entire list of 
 users available on the OS, local and remote (AD/etc.).
 This behaviour of the command is advised to be and is prevented by 
 administrators in most secure setups to avoid excessive load to the ADs 
 involved, as the # of users to be listed may be too large, and the repeated 
 requests of ALL users not present in the cache would be too much for the AD 
 infrastructure to bear.
 The NFS server should likely do lookups based on a specific UID request, via 
 'getent passwd UID', if the UID does not match a cached value. This reduces 
 load on the LDAP backed infrastructure.
 Thanks [~qwertymaniac] for reporting the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7400) More reliable namenode health check to detect OS/HW issues

[
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213281#comment-14213281
]

Ming Ma commented on HDFS-7400:
---

Thanks, [~andrew.wang] and [~aw] for the comments. Here is the info I have so
far. I will provide more data after I gather more data from our admins and HW
engineers.

1. We couldn't access the machine except to reboot the machine via IPMI. So no
chance to run df.
2. We didn't check the progress in ssh. But given all DNs couldn't connect to
this NN at that point, it looks like socket level issue.

More reliable namenode health check to detect OS/HW issues
--

Key: HDFS-7400
URL: https://issues.apache.org/jira/browse/HDFS-7400
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ming Ma

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7394) Log at INFO level when InvalidToken is seen in ShortCircuitCache


[ 
https://issues.apache.org/jira/browse/HDFS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213286#comment-14213286
 ] 

Hadoop QA commented on HDFS-7394:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681654/HDFS-7394.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8744//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8744//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8744//console

This message is automatically generated.

 Log at INFO level when InvalidToken is seen in ShortCircuitCache
 

 Key: HDFS-7394
 URL: https://issues.apache.org/jira/browse/HDFS-7394
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Keith Pak
Priority: Minor
 Attachments: HDFS-7394.patch


 For long running clients, getting an {{InvalidToken}} exception is expected 
 and the client refetches a block token when it happens.  The related events 
 are logged at INFO except the ones in {{ShortCircuitCache}}.  It will be 
 better if they are also made to log at INFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7270) Implementing congestion control in writing pipeline


[ 
https://issues.apache.org/jira/browse/HDFS-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213287#comment-14213287
 ] 

Hadoop QA commented on HDFS-7270:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681462/HDFS-7270.000.patch
  against trunk revision 49c3889.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.server.datanode.TestDataNodeMetrics
  org.apache.hadoop.hdfs.TestCrcCorruption

  The following test timeouts occurred in 
hadoop-hdfs-project/hadoop-hdfs:

org.apache.hadoop.hdfs.TestParallelShortCircuitReadUnCached

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8743//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8743//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8743//console

This message is automatically generated.

 Implementing congestion control in writing pipeline
 ---

 Key: HDFS-7270
 URL: https://issues.apache.org/jira/browse/HDFS-7270
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7270.000.patch


 When a client writes to HDFS faster than the disk bandwidth of the DNs, it  
 saturates the disk bandwidth and put the DNs unresponsive. The client only 
 backs off by aborting / recovering the pipeline, which leads to failed writes 
 and unnecessary pipeline recovery.
 This jira proposes to add explicit congestion control mechanisms in the 
 writing pipeline. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop


[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213324#comment-14213324
 ] 

Hadoop QA commented on HDFS-4882:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681664/HDFS-4882.4.patch
  against trunk revision 4fb96db.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8745//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8745//console

This message is automatically generated.

 Namenode LeaseManager checkLeases() runs into infinite loop
 ---

 Key: HDFS-4882
 URL: https://issues.apache.org/jira/browse/HDFS-4882
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Affects Versions: 2.0.0-alpha, 2.5.1
Reporter: Zesheng Wu
Assignee: Ravi Prakash
Priority: Critical
 Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
 HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch


 Scenario:
 1. cluster with 4 DNs
 2. the size of the file to be written is a little more than one block
 3. write the first block to 3 DNs, DN1-DN2-DN3
 4. all the data packets of first block is successfully acked and the client 
 sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
 5. DN2 and DN3 are down
 6. client recovers the pipeline, but no new DN is added to the pipeline 
 because of the current pipeline stage is PIPELINE_CLOSE
 7. client continuously writes the last block, and try to close the file after 
 written all the data
 8. NN finds that the penultimate block doesn't has enough replica(our 
 dfs.namenode.replication.min=2), and the client's close runs into indefinite 
 loop(HDFS-2936), and at the same time, NN makes the last block's state to 
 COMPLETE
 9. shutdown the client
 10. the file's lease exceeds hard limit
 11. LeaseManager realizes that and begin to do lease recovery by call 
 fsnamesystem.internalReleaseLease()
 12. but the last block's state is COMPLETE, and this triggers lease manager's 
 infinite loop and prints massive logs like this:
 {noformat}
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
 DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
  limit
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
 /user/h_wuzesheng/test.dat
 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
 NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
 blk_-7028017402720175688_1202597,
 lastBLockState=COMPLETE
 2013-06-05,17:42:25,695 INFO 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
 for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
 APREDUCE_-1252656407_1, pendingcreates: 1]
 {noformat}
 (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213326#comment-14213326
 ] 

Vinayakumar B commented on HDFS-7384:
-

Thanks Chris. 
For the effective action, may be we can have separate method, without affecting 
the current fields.
It will just be a alternative way for client to get the effective action, 
instead of calculating on its own.

I will upload a patch soon.

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B

 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6982) nntop: top-like tool for name node users

2014-11-14 Thread Maysam Yabandeh (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maysam Yabandeh updated HDFS-6982:
--
Attachment: HDFS-6982.v9.patch

Attaching the new patch. [~andrew.wang], I ended up moving TopMetrics 
initialization to FsNamesystem, where I register TopAuditLogger with the 
aduitLoggers.

 nntop: top-like tool for name node users
 -

 Key: HDFS-6982
 URL: https://issues.apache.org/jira/browse/HDFS-6982
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
 HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
 HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, 
 nntop-design-v1.pdf


 In this jira we motivate the need for nntop, a tool that, similarly to what 
 top does in Linux, gives the list of top users of the HDFS name node and 
 gives insight about which users are sending majority of each traffic type to 
 the name node. This information turns out to be the most critical when the 
 name node is under pressure and the HDFS admin needs to know which user is 
 hammering the name node and with what kind of requests. Here we present the 
 design of nntop which has been in production at Twitter in the past 10 
 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K 
 nodes), low memory footprint (less than a few MB), and quite efficient for 
 the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes


[ 
https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213355#comment-14213355
 ] 

Ming Ma commented on HDFS-7374:
---

Yeah, that seems reasonable; How likely you get whole cluster fully 
replicated might depend on how you count it. If it is based on full scan of 
blockmap, the chance of getting all blocks fully replicated condition might 
be low given it also includes those newly added blocks for which not all DNs 
have sent IBR; in addition, it has to take the FSNameSystem lock for longer 
period of time. If it is based on {{BlockManager}}'s 
{{pendingReplicationBlocksCount}} +  {{underReplicatedBlocksCount}}, then the 
chance might be higher; and it is faster.

On the track the blocks of those DECOMM_IN_PROGRESS DNs note, it might be 
useful to add the feature later. It also helps another scenario, something 
[~kihwal] and [~daryn] mentioned before. {{isReplicationInProgress}} currently 
rescan all blocks of a given node each time the method is called; it isn't 
efficient as more blocks become fully replicated.

We can have a separate list of DECOMM_IN_PROGRESS blocks which points to the 
DECOMM_IN_PROGRESS DNs.  {{DecommissionManager}} will scan this list regularly. 
Each scan will reduce the list as blocks become fully replicated and calculated 
the latest list of DECOMM_IN_PROGRESS DNs. In normal decomm operations, the # 
of DECOMM_IN_PROGRESS DNs should be much smaller than the # of total DNs in 
large cluster; so the extra memory overhead might be acceptable.

 Allow decommissioning of dead DataNodes
 ---

 Key: HDFS-7374
 URL: https://issues.apache.org/jira/browse/HDFS-7374
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhe Zhang
Assignee: Zhe Zhang
 Attachments: HDFS-7374-001.patch, HDFS-7374-002.patch


 We have seen the use case of decommissioning DataNodes that are already dead 
 or unresponsive, and not expected to rejoin the cluster.
 The logic introduced by HDFS-6791 will mark those nodes as 
 {{DECOMMISSION_INPROGRESS}}, with a hope that they can come back and finish 
 the decommission work. If an upper layer application is monitoring the 
 decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods

2014-11-14 Thread Tsz Wo Nicholas Sze (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213392#comment-14213392
 ] 

Tsz Wo Nicholas Sze commented on HDFS-7279:
---

 the throw clause comes from the super class thus it cannot be removed.

It actually can be removed since removing it is narrowing the declaration.

+1 the new patch looks good other than that.

 Use netty to implement DatanodeWebHdfsMethods
 -

 Key: HDFS-7279
 URL: https://issues.apache.org/jira/browse/HDFS-7279
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, webhdfs
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
 HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
 HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
 HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
 HDFS-7279.011.patch


 Currently the DN implements all related webhdfs functionality using jetty. As 
 the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
 and connection management, DN often suffers from long latency and OOM when 
 its webhdfs component is under sustained heavy load.
 This jira proposes to implement the webhdfs component in DN using netty, 
 which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


 [ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7384:

Attachment: HDFS-7384-001.patch

Attached patch.
Please review and give your feedback

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B
 Attachments: HDFS-7384-001.patch


 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


 [ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinayakumar B updated HDFS-7384:

Status: Patch Available  (was: Open)

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B
 Attachments: HDFS-7384-001.patch


 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-6982) nntop: top-like tool for name node users


[ 
https://issues.apache.org/jira/browse/HDFS-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213436#comment-14213436
 ] 

Hadoop QA commented on HDFS-6982:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681698/HDFS-6982.v9.patch
  against trunk revision 9b86066.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.metrics2.impl.TestMetricsSystemImpl
  org.apache.hadoop.ha.TestZKFailoverControllerStress
  org.apache.hadoop.hdfs.server.namenode.ha.TestBootstrapStandby
  org.apache.hadoop.hdfs.TestDistributedFileSystem

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8747//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8747//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8747//console

This message is automatically generated.

 nntop: top-like tool for name node users
 -

 Key: HDFS-6982
 URL: https://issues.apache.org/jira/browse/HDFS-6982
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Maysam Yabandeh
Assignee: Maysam Yabandeh
 Attachments: HDFS-6982.patch, HDFS-6982.v2.patch, HDFS-6982.v3.patch, 
 HDFS-6982.v4.patch, HDFS-6982.v5.patch, HDFS-6982.v6.patch, 
 HDFS-6982.v7.patch, HDFS-6982.v8.patch, HDFS-6982.v9.patch, 
 nntop-design-v1.pdf


 In this jira we motivate the need for nntop, a tool that, similarly to what 
 top does in Linux, gives the list of top users of the HDFS name node and 
 gives insight about which users are sending majority of each traffic type to 
 the name node. This information turns out to be the most critical when the 
 name node is under pressure and the HDFS admin needs to know which user is 
 hammering the name node and with what kind of requests. Here we present the 
 design of nntop which has been in production at Twitter in the past 10 
 months. nntop proved to have low cpu overhead ( 2% in a cluster of 4K 
 nodes), low memory footprint (less than a few MB), and quite efficient for 
 the write path (only two hash lookup for updating a metric).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213438#comment-14213438
 ] 

Hadoop QA commented on HDFS-7384:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12681715/HDFS-7384-001.patch
  against trunk revision 9b86066.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 eclipse:eclipse{color}.  The patch failed to build with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/8748//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8748//console

This message is automatically generated.

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B
 Attachments: HDFS-7384-001.patch


 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7384) 'getfacl' command and 'getAclStatus' output should be in sync


[ 
https://issues.apache.org/jira/browse/HDFS-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213439#comment-14213439
 ] 

Chris Nauroth commented on HDFS-7384:
-

I haven't reviewed the whole patch yet, but I wanted to state again quickly 
that I'd prefer to keep effective permissions out of {{AclEntry}}.

One problem is that the {{AclEntry}} class is also used in the setter APIs, 
like {{setAcl}}.  In that context, the effective permissions would be ignored.  
This could cause confusion for users of those APIs.

Another problem is that we use the same class for both the public API on the 
client side and the internal in-memory representation in the NameNode.  
Therefore, adding a new member to {{AclEntry}} would have a side effect of 
increasing memory footprint in the NameNode.  Even if we don't populate the 
field when used within the NameNode, there is still the overhead of the 
additional pointer multiplied by every ACL entry.  We could potentially change 
the NameNode to use a different class for its internal implementation, but then 
we'd have a dual-maintenance problem and a need for extra code to translate 
between the two representations.

If {{AclStatus}} could have a new method that does the calculation for an 
entry's effective permissions on demand, instead of requiring a new member in 
{{AclEntry}}, then we wouldn't impact the setter APIs or increase memory 
footprint in the NameNode.

 'getfacl' command and 'getAclStatus' output should be in sync
 -

 Key: HDFS-7384
 URL: https://issues.apache.org/jira/browse/HDFS-7384
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Vinayakumar B
Assignee: Vinayakumar B
 Attachments: HDFS-7384-001.patch


 *getfacl* command will print all the entries including basic and extended 
 entries, mask entries and effective permissions.
 But, *getAclStatus* FileSystem API will return only extended ACL entries set 
 by the user. But this will not include the mask entry as well as effective 
 permissions.
 To benefit the client using API, better to include 'mask' entry and effective 
 permissions in the return list of entries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods


[ 
https://issues.apache.org/jira/browse/HDFS-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213442#comment-14213442
 ] 

Haohui Mai commented on HDFS-7279:
--

The v12 patches removes the excessive throw clauses.

 Use netty to implement DatanodeWebHdfsMethods
 -

 Key: HDFS-7279
 URL: https://issues.apache.org/jira/browse/HDFS-7279
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, webhdfs
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-7279.000.patch, HDFS-7279.001.patch, 
 HDFS-7279.002.patch, HDFS-7279.003.patch, HDFS-7279.004.patch, 
 HDFS-7279.005.patch, HDFS-7279.006.patch, HDFS-7279.007.patch, 
 HDFS-7279.008.patch, HDFS-7279.009.patch, HDFS-7279.010.patch, 
 HDFS-7279.011.patch, HDFS-7279.012.patch


 Currently the DN implements all related webhdfs functionality using jetty. As 
 the current jetty version the DN used (jetty 6) lacks of fine-grained buffer 
 and connection management, DN often suffers from long latency and OOM when 
 its webhdfs component is under sustained heavy load.
 This jira proposes to implement the webhdfs component in DN using netty, 
 which can be more efficient and allow more finer-grain controls on webhdfs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7279) Use netty to implement DatanodeWebHdfsMethods