[jira] [Created] (HDFS-11039) Expose more configuration properties to hdfs-default.xml

2016-10-19 Thread Yi Liu (JIRA)
Yi Liu created HDFS-11039:
-

 Summary: Expose more configuration properties to hdfs-default.xml
 Key: HDFS-11039
 URL: https://issues.apache.org/jira/browse/HDFS-11039
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation, newbie
Reporter: Yi Liu
Assignee: Jennica Pounds
Priority: Minor


There are some configuration properties for hdfs, but have not been exposed in 
hdfs-default.xml.

It's convenient for Hadoop user/admin if we add them in the hdfs-default.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11020) Add more doc for HDFS transparent encryption

2016-10-17 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-11020:
--
Assignee: Jennica Pounds  (was: Yi Liu)

> Add more doc for HDFS transparent encryption
> 
>
> Key: HDFS-11020
> URL: https://issues.apache.org/jira/browse/HDFS-11020
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: documentation, encryption, fs
>Reporter: Yi Liu
>Assignee: Jennica Pounds
>Priority: Minor
>
> We need correct version of Openssl which supports hardware acceleration of 
> AES CTR.
> Let's add more doc about how to configure the correct Openssl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11020) Add more doc for HDFS transparent encryption

2016-10-17 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-11020:
--
Summary: Add more doc for HDFS transparent encryption  (was: Enable HDFS 
transparent encryption doc)

> Add more doc for HDFS transparent encryption
> 
>
> Key: HDFS-11020
> URL: https://issues.apache.org/jira/browse/HDFS-11020
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: documentation, encryption, fs
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Minor
>
> We need correct version of Openssl which supports hardware acceleration of 
> AES CTR.
> Let's add more doc about how to configure the correct Openssl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11020) Enable HDFS transparent encryption doc

2016-10-17 Thread Yi Liu (JIRA)
Yi Liu created HDFS-11020:
-

 Summary: Enable HDFS transparent encryption doc
 Key: HDFS-11020
 URL: https://issues.apache.org/jira/browse/HDFS-11020
 Project: Hadoop HDFS
  Issue Type: Task
  Components: documentation, encryption, fs
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Minor


We need correct version of Openssl which supports hardware acceleration of AES 
CTR.
Let's add more doc about how to configure the correct Openssl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2016-04-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261467#comment-15261467
 ] 

Yi Liu commented on HDFS-9276:
--

Not only one people tell me they encounter this issue in real cluster, and ask 
me to help pushing the fix.
>From my point of view, the approach in this patch is generally OK, and may 
>still need some refinement. 

[~daryn], [~ste...@apache.org], [~cnauroth], could you help to check too?



> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, 
> HDFS-9276.06.patch, HDFS-9276.07.patch, HDFS-9276.08.patch, 
> HDFS-9276.09.patch, HDFS-9276.10.patch, HDFS-9276.11.patch, 
> HDFS-9276.12.patch, HDFS-9276.13.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> 

[jira] [Updated] (HDFS-8786) Erasure coding: DataNode should transfer striped blocks before being decommissioned

2016-02-14 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8786:
-
Assignee: Rakesh R  (was: Yi Liu)

> Erasure coding: DataNode should transfer striped blocks before being 
> decommissioned
> ---
>
> Key: HDFS-8786
> URL: https://issues.apache.org/jira/browse/HDFS-8786
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Zhe Zhang
>Assignee: Rakesh R
>
> Per [discussion | 
> https://issues.apache.org/jira/browse/HDFS-8697?focusedCommentId=14609004=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14609004]
>  under HDFS-8697, it's too expensive to reconstruct block groups for decomm 
> purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9713) DataXceiver#copyBlock should return if block is pinned

2016-02-04 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133453#comment-15133453
 ] 

Yi Liu commented on HDFS-9713:
--

+1, good catch. Thanks Uma.

> DataXceiver#copyBlock should return if block is pinned
> --
>
> Key: HDFS-9713
> URL: https://issues.apache.org/jira/browse/HDFS-9713
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.2
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-9713-00.patch
>
>
> in DataXceiver#copyBlock
> {code}
>   if (datanode.data.getPinning(block)) {
>   String msg = "Not able to copy block " + block.getBlockId() + " " +
>   "to " + peer.getRemoteAddressString() + " because it's pinned ";
>   LOG.info(msg);
>   sendResponse(ERROR, msg);
> }
> {code}
> I think we should return back instead of proceeding to send block.as we 
> already sent ERROR here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9582) TestLeaseRecoveryStriped file missing Apache License header and not well formatted

2015-12-19 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15065602#comment-15065602
 ] 

Yi Liu commented on HDFS-9582:
--

+1 pending Jenkins.
Thanks Uma.

> TestLeaseRecoveryStriped file missing Apache License header and not well 
> formatted
> --
>
> Key: HDFS-9582
> URL: https://issues.apache.org/jira/browse/HDFS-9582
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Minor
> Attachments: HDFS-9582-Trunk.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (HDFS-8562) HDFS Performance is impacted by FileInputStream Finalizer

2015-12-13 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8562:
-
Comment: was deleted

(was: GitHub user hash-X opened a pull request:

https://github.com/apache/hadoop/pull/42

AltFileInputStream.java replace FileInputStream.java in apache/hadoop/HDFS



A brief description
Long Stop-The-World GC pauses due to Final Reference processing are 
observed.
So, Where are those Final Reference come from ?

1 : `Finalizer`
2 : `FileInputStream`

How to solve this problem ?

Here is the detailed description,and I give a solution on this.
https://issues.apache.org/jira/browse/HDFS-8562

FileInputStream have a method of finalize , and it can cause GC pause for a 
long time.In our test,G1 as our GC. So,in AltFileInputStream , no finalize. A 
new design for a inputstream use in windows and non-windows.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hash-X/hadoop AltFileInputStream

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hadoop/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit 8d64ef0feb8c8d8f5d5823ccaa428a1b58f6fd04
Author: zhangminglei 
Date:   2015-07-19T09:50:19Z

Add some code.

commit 3ccf4c70c40cf1ba921d76b949317b5fd6752e3c
Author: zhangminglei 
Date:   2015-07-19T09:56:49Z

I cannot replace FileInputStream to NewFileInputStream in a casual 
way,cause the act
of change can damage other part of the HDFS.For example,When I test my code
using a Single Node (psedo-distributed) Cluster."Failed to load an FSImage
file." will happen when I start HDFS Daemons.At start,I replace
many FileInputStream which happend as an arg or constructor to
NewFileInputStream,but it seems like wrong.So,I have to do this in another
way.

commit 4da55130586ee9803a09162f7e2482b533aa12d9
Author: zhangminglei 
Date:   2015-07-19T10:30:11Z

Replace FIS to NFIS( NewFileInputStream ) is not recommend I
think,although there is a man named Alan Bateman from
https://bugs.openjdk.java.net/browse/JDK-8080225
suggest that.But test shows it is not good.Some problem may happen.And
these test consume so long time.Every time I change the source code,I need 
to
build the whole project (maybe it is not needed).But I install the new
version hadoop on my computer.So,build the whole project is needed.Maybe
should have a good way to do it I think.

commit 06b1509e0ad6dd74cf7c903e6ed6f2ec74d9b341
Author: zhangminglei 
Date:   2015-07-19T11:06:37Z

Replace FIS to NFIS,If test success,just do these first.It is not as
simple as that.

commit 2a79cd9c3b012556af7db5bdbf96663a1c30dcc4
Author: zhangminglei 
Date:   2015-07-20T02:36:55Z

Add a LOG info in DataXceiver for test.

commit 436c998ae21b3fe843b2d5ba6506e37ff2a34ab2
Author: zhangminglei 
Date:   2015-07-20T06:01:41Z

Rename NewFileInputStream to AltFileInputStream.

commit 14de2788ea2407c6ee252a69cfd3b4f6132c6faa
Author: zhangminglei 
Date:   2015-07-20T06:16:32Z

replace License header to Apache.

commit 387f7624a96716abef2062986f05523199e1927e
Author: zhangminglei 
Date:   2015-07-20T07:16:25Z

Remove open method in AltFileInputStream.java.

commit 52b029fac56bc054add1eac836e6cf71a0735304
Author: zhangminglei 
Date:   2015-07-20T10:14:09Z

Performance between AltFileInputStream and FileInputStream is not do
from this commit.Important question I think whether
AltFileInputStream could convert
to FileInputStream safely.I define a frame plan to do it.But I don't know is
this correct for the problem ? In HDFS code,compulsory conversion to
FileInputStream is happend everywhere.

commit e76d5eb4bf0145a4b28c581ecec07dcee7bae4e5
Author: zhangminglei 
Date:   2015-07-20T13:11:24Z

I think the compulsory conversion is safety,cause the AltFileInputSteam is
the subclass of the InputStream.In the previous version of the HDFS,the
convert to FileInputStream I see is safety cause these method return
InputStream which is the superclass of the FileInputStream.In my version
of HDFS,InputStream is also the superclass of the
AltFileInputStream.So,AltFileInputStream is also a InputStream just like
the FileInputStream is a InputStream too.So,I think it is
safety.Everyone agree ? If not,please give your opinion and tell me What's
wrong with that ? thank you.

commit 

[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-11-06 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993370#comment-14993370
 ] 

Yi Liu commented on HDFS-9276:
--

In the test, you get the token1/token2 for current login user, suppose it's 
UserA.   When you use the token in other places, you should doAs that user 
"UserA", right?  You doAs a different user "Test" (a remote login user), that 
means the service (spark executor)  will use UserA's delegation token to access 
HDFS on behalf of "Test" user, is it right?


> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, 
> HDFS-9276.06.patch, HDFS-9276.07.patch, HDFS-9276.08.patch, 
> HDFS-9276.09.patch, HDFS-9276.10.patch, HDFS-9276.11.patch, 
> HDFS-9276.12.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991615#comment-14991615
 ] 

Yi Liu commented on HDFS-4937:
--

This time, it's correct now.  The logic of current patch is straight. 

+1 for the {{v1}} patch, thanks Kihwal.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-11-05 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991686#comment-14991686
 ] 

Yi Liu commented on HDFS-9276:
--

Yes, even the security is not enabled, but if there is token in user's 
credentials, sasl is used, so will use delegation token for authentication.

Further comments for tests:
*1.* Currently we get token1 and token2 for current user, but create a "test" 
user to list files. We should make there consistent:  either using doAs of 
"Test" to get tokens, or no need to create "Test" user, just use current user 
and add tokens to current user's UGI.
*2.* no need to modify {{getDelegationToken}}, we can get the token1/token2 and 
add it to UGI directly, we also don't need creds1/creds2

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, 
> HDFS-9276.06.patch, HDFS-9276.07.patch, HDFS-9276.08.patch, 
> HDFS-9276.09.patch, HDFS-9276.10.patch, HDFS-9276.11.patch, debug1.PNG, 
> debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> 

[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-11-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986402#comment-14986402
 ] 

Yi Liu commented on HDFS-9276:
--

Liangliang, thanks for the update.  See following comments. (I forgot to add #1 
in previous comments)
*1.*
{code}
tokenMap.put(alias, t);

// update private token
...
{code}

We can update the private token if the token exists, see following:
{code}
if (tokenMap.put(alias, t) != null) {
  // update private token
  ...
}
{code}

*2.*
{code}
@Override
public int hashCode() {
  return super.hashCode() ^ publicService.hashCode();
}
{code}
Could be:
{code}
@Override
public int hashCode() {
  return super.hashCode();
}
{code}

*3.*
I looked at {{testCancelAndUpdateDelegationTokens}} again, it doesn't work as 
expected.  You added credentials for "Test", but use another user to do 
listFiles.  Also only set 
{{DFSConfigKeys.DFS_NAMENODE_DELEGATION_TOKEN_ALWAYS_USE_KEY}} doesn't mean 
security enabled, and the authentication is still simple.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, 
> HDFS-9276.06.patch, HDFS-9276.07.patch, HDFS-9276.08.patch, 
> HDFS-9276.09.patch, HDFS-9276.10.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at 

[jira] [Commented] (HDFS-9275) Wait previous ErasureCodingWork to finish before schedule another one

2015-11-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986424#comment-14986424
 ] 

Yi Liu commented on HDFS-9275:
--

Looks good now, +1.

> Wait previous ErasureCodingWork to finish before schedule another one
> -
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch, HDFS-9275.04.patch, HDFS-9275.05.patch
>
>
> In {{ErasureCodingWorker}}, for the same block group, one task doesn't know 
> which internal blocks is in recovering by other tasks. We could end up with 
> recovering 2 identical block with same index. So, {{ReplicationMonitor}} 
> should wait previous work to finish before schedule another one.
> This is related to the occasional failure of {{TestRecoverStripedFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9275) Wait previous ErasureCodingWork to finish before schedule another one

2015-11-02 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9275:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Committed to trunk.

> Wait previous ErasureCodingWork to finish before schedule another one
> -
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Walter Su
>Assignee: Walter Su
> Fix For: 3.0.0
>
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch, HDFS-9275.04.patch, HDFS-9275.05.patch
>
>
> In {{ErasureCodingWorker}}, for the same block group, one task doesn't know 
> which internal blocks is in recovering by other tasks. We could end up with 
> recovering 2 identical block with same index. So, {{ReplicationMonitor}} 
> should wait previous work to finish before schedule another one.
> This is related to the occasional failure of {{TestRecoverStripedFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Wait previous ErasureCodingWork to finish before schedule another one

2015-11-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986431#comment-14986431
 ] 

Yi Liu commented on HDFS-9275:
--

Will commit shortly.

> Wait previous ErasureCodingWork to finish before schedule another one
> -
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch, HDFS-9275.04.patch, HDFS-9275.05.patch
>
>
> In {{ErasureCodingWorker}}, for the same block group, one task doesn't know 
> which internal blocks is in recovering by other tasks. We could end up with 
> recovering 2 identical block with same index. So, {{ReplicationMonitor}} 
> should wait previous work to finish before schedule another one.
> This is related to the occasional failure of {{TestRecoverStripedFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9348) DFS GetErasureCodingPolicy API on a non-existent file should be handled properly

2015-11-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986476#comment-14986476
 ] 

Yi Liu commented on HDFS-9348:
--

Thanks Uma for ping me, and Rakesh for the work. I think it makes sense to 
throw FileNotFoundException for non-existent file.

> DFS GetErasureCodingPolicy API on a non-existent file should be handled 
> properly
> 
>
> Key: HDFS-9348
> URL: https://issues.apache.org/jira/browse/HDFS-9348
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Rakesh R
>Assignee: Rakesh R
>Priority: Minor
> Attachments: HDFS-9348-00.patch
>
>
> Presently calling {{dfs#getErasureCodingPolicy()}} on a non-existent file is 
> returning the ErasureCodingPolicy info. As per the 
> [discussion|https://issues.apache.org/jira/browse/HDFS-8777?focusedCommentId=14981077=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981077]
>  it has to validate and throw FileNotFoundException.
> Also, {{dfs#getEncryptionZoneForPath()}} API has the same behavior. Again we 
> can discuss to add the file existence validation in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9348) DFS GetErasureCodingPolicy API on a non-existent file should be handled properly

2015-11-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986476#comment-14986476
 ] 

Yi Liu edited comment on HDFS-9348 at 11/3/15 1:41 AM:
---

Thanks Uma for ping me, and Rakesh for the work. Agree with Andrew and I think 
it makes sense to throw FileNotFoundException for non-existent file.


was (Author: hitliuyi):
Thanks Uma for ping me, and Rakesh for the work. I think it makes sense to 
throw FileNotFoundException for non-existent file.

> DFS GetErasureCodingPolicy API on a non-existent file should be handled 
> properly
> 
>
> Key: HDFS-9348
> URL: https://issues.apache.org/jira/browse/HDFS-9348
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Rakesh R
>Assignee: Rakesh R
>Priority: Minor
> Attachments: HDFS-9348-00.patch
>
>
> Presently calling {{dfs#getErasureCodingPolicy()}} on a non-existent file is 
> returning the ErasureCodingPolicy info. As per the 
> [discussion|https://issues.apache.org/jira/browse/HDFS-8777?focusedCommentId=14981077=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981077]
>  it has to validate and throw FileNotFoundException.
> Also, {{dfs#getEncryptionZoneForPath()}} API has the same behavior. Again we 
> can discuss to add the file existence validation in this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983905#comment-14983905
 ] 

Yi Liu edited comment on HDFS-4937 at 10/31/15 8:46 AM:


Revert from trunk, branch-2, branch-2.7.  Thanks Vinay and Brahma.

I thought the tests passed... But actually the jenkins doesn't include the 
tests result. 


was (Author: hitliuyi):
Revert from trunk, branch-2, branch-2.7.  Thanks Brahma.

I thought the tests passed... But actually the jenkins doesn't include the 
tests result. 

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu reopened HDFS-4937:
--

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983905#comment-14983905
 ] 

Yi Liu commented on HDFS-4937:
--

Revert from trunk, branch-2, branch-2.7.  Thanks Brahma.

I thought the tests passed... But actually the jenkins doesn't include the 
tests result. 

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983906#comment-14983906
 ] 

Yi Liu edited comment on HDFS-4937 at 10/31/15 8:45 AM:


I did consider the situation you mentioned, But I thought in real env the NN 
could find other racks/DNs if it has gone through enough (not all) number of 
nodes.  But I missed the fact that many tests may only contain few available 
DNs, and {{refreshCounter <= excludedNodes.size()}} will be true, also in real 
env this also may happen if total number of DNs is few.  So the patch should 
not be correct for these cases, revert them.


was (Author: hitliuyi):
I did consider the situation you mentioned, But I thought in real env the NN 
could find other racks/DNs if it has gone through enough number of nodes.  But 
I missed the fact that many tests may only contain few available DNs, and 
{{refreshCounter <= excludedNodes.size()}} will be true, also in real env this 
also may happen if total number of DNs is few.  So the patch should not be 
correct for these cases, revert them.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983906#comment-14983906
 ] 

Yi Liu commented on HDFS-4937:
--

I did consider the situation you mentioned, But I thought in real env the NN 
could find other racks/DNs if it has gone through enough number of nodes.  But 
I missed the fact that many tests may only contain few available DNs, and 
{{refreshCounter <= excludedNodes.size()}} will be true, also in real env this 
also may happen if total number of DNs is few.  So the patch should not be 
correct for these cases, revert them.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981789#comment-14981789
 ] 

Yi Liu commented on HDFS-4937:
--

+1, thanks Kihwal.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981710#comment-14981710
 ] 

Yi Liu commented on HDFS-9276:
--

*1.* The code format is not correct.   The indent is 2 spaces.  If some line is 
too long and break it, then the indent is 4 spaces.
*2.* You should cleanup the unused imports and whitespace.
*3.* There is one findbugs,  seems you need to override the equals in 
PrivateToken, and you can check whether the publicService is the same there.
*4.* 
You can move your test into {{TestDelegationTokensWithHA}}, and you need to 
check access successful before cancel token1.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, 
> HDFS-9276.06.patch, HDFS-9276.07.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Commented] (HDFS-9275) Wait previous ErasureCodingWork to finish before schedule another one

2015-10-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981734#comment-14981734
 ] 

Yi Liu commented on HDFS-9275:
--

{code}
+  if (pendingNum > 0) {
+// Wait the previous recovery to finish.
+return false;
+  }
{code}
We also needs to do two more things:
1. Also check this in {{scheduleRecovery}} to avoid unnecessary choose targets.
2. move the block group to end of queue of same priority in 
{{neededReplications}}, otherwise it's chosen first again next time.



> Wait previous ErasureCodingWork to finish before schedule another one
> -
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch, HDFS-9275.04.patch
>
>
> In {{ErasureCodingWorker}}, for the same block group, one task doesn't know 
> which internal blocks is in recovering by other tasks. We could end up with 
> recovering 2 identical block with same index. So, {{ReplicationMonitor}} 
> should wait previous work to finish before schedule another one.
> This is related to the occasional failure of {{TestRecoverStripedFile}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9302) WebHDFS throws NullPointerException if newLength is not provided

2015-10-28 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977821#comment-14977821
 ] 

Yi Liu commented on HDFS-9302:
--

OK, thanks Jagadesh, Akira, will commit shortly.

> WebHDFS throws NullPointerException if newLength is not provided
> 
>
> Key: HDFS-9302
> URL: https://issues.apache.org/jira/browse/HDFS-9302
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
> Environment: Centos6
>Reporter: Karthik Palaniappan
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Attachments: HDFS-9302_00.patch, HDFS-9302_01.patch
>
>
> $ curl -X POST "http://namenode:50070/webhdfs/v1/foo?op=truncate;
> {"RemoteException":{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException","message":null}}
> We should change newLength to be a required parameter in the webhdfs 
> documentation 
> (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#New_Length),
>  and throw an IllegalArgumentException if isn't provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9302) WebHDFS throws NullPointerException if newLength is not provided

2015-10-28 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9302:
-
Target Version/s: 2.8.0  (was: 2.8.0, 2.7.2)

> WebHDFS throws NullPointerException if newLength is not provided
> 
>
> Key: HDFS-9302
> URL: https://issues.apache.org/jira/browse/HDFS-9302
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
> Environment: Centos6
>Reporter: Karthik Palaniappan
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Attachments: HDFS-9302_00.patch, HDFS-9302_01.patch
>
>
> $ curl -X POST "http://namenode:50070/webhdfs/v1/foo?op=truncate;
> {"RemoteException":{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException","message":null}}
> We should change newLength to be a required parameter in the webhdfs 
> documentation 
> (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#New_Length),
>  and throw an IllegalArgumentException if isn't provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9302) WebHDFS throws NullPointerException if newLength is not provided

2015-10-28 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9302:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2.

> WebHDFS throws NullPointerException if newLength is not provided
> 
>
> Key: HDFS-9302
> URL: https://issues.apache.org/jira/browse/HDFS-9302
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
> Environment: Centos6
>Reporter: Karthik Palaniappan
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Fix For: 2.8.0
>
> Attachments: HDFS-9302_00.patch, HDFS-9302_01.patch
>
>
> $ curl -X POST "http://namenode:50070/webhdfs/v1/foo?op=truncate;
> {"RemoteException":{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException","message":null}}
> We should change newLength to be a required parameter in the webhdfs 
> documentation 
> (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#New_Length),
>  and throw an IllegalArgumentException if isn't provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976233#comment-14976233
 ] 

Yi Liu commented on HDFS-9275:
--

Walter, sorry I forgot this JIRA :-) 

For continuous block, if n replicas are missed (for total 3 replicas, at most 2 
can be missed, so n <3), we will check the total of replicas in 
PendingReplicationBlocks to see whether we need to schedule new block 
replication.
For block reconstruction of striped block, ideally we should follow this,  for 
any missed striped internal block, we just need to reconstruct 1, so we should 
check whether there is 1 in pendingReplicationBlocks,  but currently we track 
the block group in the list.  Then it becomes we compare the total missed 
striped internal blocks with the number in PendingReplicationBlocks, if there 
are more than two missed striped internal blocks and one is reconstructed 
first, then there may be some unnecessary reconstruction.   I think we can do a 
simple improvement for striped block, if there is one in 
PendingReplicationBlocks, then we don't schedule new reconstruction work 
instead of comparing the number of missed striped internal blocks. 

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9302) WebHDFS throws NullPointerException if newLength is not provided

2015-10-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977483#comment-14977483
 ] 

Yi Liu commented on HDFS-9302:
--

Looks good, Jagadesh.
Just a small nit, +1 after addressing
- please put the check above the code comment "// We treat each rest request as 
a separate client."

> WebHDFS throws NullPointerException if newLength is not provided
> 
>
> Key: HDFS-9302
> URL: https://issues.apache.org/jira/browse/HDFS-9302
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
> Environment: Centos6
>Reporter: Karthik Palaniappan
>Assignee: Jagadesh Kiran N
>Priority: Minor
> Attachments: HDFS-9302_00.patch
>
>
> $ curl -X POST "http://namenode:50070/webhdfs/v1/foo?op=truncate;
> {"RemoteException":{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException","message":null}}
> We should change newLength to be a required parameter in the webhdfs 
> documentation 
> (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#New_Length),
>  and throw an IllegalArgumentException if isn't provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976479#comment-14976479
 ] 

Yi Liu commented on HDFS-9276:
--

Now looks good overall.
*1.*
{code}
+
+  /**
+   * Create a private token for HA failover proxy.
+   * @return the private token
+   */
+  public PrivateToken createPrivateToken() {
+return new PrivateToken(this);
+  }
{code}
Now, we can remove this.

*2.*
{code}
-} else if (right == null || getClass() != right.getClass()) {
+} else if (right == null || !(right instanceof Token)) {
{code}
Any reason to change this? It violates {{equals}} definition.

*3.*
We should have test for real case, maybe like following:
- build a MiniDFSCluster with HA and security enabled.
- Gets delegation token
- Use the delegation token to access HDFS.
- Invalid the delegation token in NN. Check the access failed.
- Get a new delegation token and update through user's UGI.
- Check the access again, should be successful.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, debug1.PNG, 
> debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> 

[jira] [Comment Edited] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-27 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976479#comment-14976479
 ] 

Yi Liu edited comment on HDFS-9276 at 10/27/15 2:31 PM:


Now looks good overall.
*1.*
{code}
+
+  /**
+   * Create a private token for HA failover proxy.
+   * @return the private token
+   */
+  public PrivateToken createPrivateToken() {
+return new PrivateToken(this);
+  }
{code}
Now, we can remove this.

*2.*
{code}
-} else if (right == null || getClass() != right.getClass()) {
+} else if (right == null || !(right instanceof Token)) {
{code}
Any reason to change this? It violates {{equals}} definition.

*3.*
We should have test for real case, maybe like following:
- build a MiniDFSCluster with HA and security enabled.
- Gets delegation token
- Use the delegation token to access HDFS.
- Invalid the delegation token in NN. Check the access failed.
- Get a new delegation token and update through user's UGI.
- Check the access again, should be successful.

You can refer to existing tests about how to enable HA and security.


was (Author: hitliuyi):
Now looks good overall.
*1.*
{code}
+
+  /**
+   * Create a private token for HA failover proxy.
+   * @return the private token
+   */
+  public PrivateToken createPrivateToken() {
+return new PrivateToken(this);
+  }
{code}
Now, we can remove this.

*2.*
{code}
-} else if (right == null || getClass() != right.getClass()) {
+} else if (right == null || !(right instanceof Token)) {
{code}
Any reason to change this? It violates {{equals}} definition.

*3.*
We should have test for real case, maybe like following:
- build a MiniDFSCluster with HA and security enabled.
- Gets delegation token
- Use the delegation token to access HDFS.
- Invalid the delegation token in NN. Check the access failed.
- Get a new delegation token and update through user's UGI.
- Check the access again, should be successful.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, HDFS-9276.04.patch, HDFS-9276.05.patch, debug1.PNG, 
> debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> 

[jira] [Comment Edited] (HDFS-7984) webhdfs:// needs to support provided delegation tokens

2015-10-26 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973802#comment-14973802
 ] 

Yi Liu edited comment on HDFS-7984 at 10/26/15 6:27 AM:


{quote}
there is no way for an end user to create a token file with multiple tokens 
inside it, short of building custom code to do it..
{quote}

No, I think we have.   When using existing 
{{Credentials#writeTokenStorageFile}}, all tokens of the credentials will be 
persisted, and {{Credentials#readTokenStorageStream}} will read all tokens too. 
 So what we need to do is to add different tokens to the Credentials, use your 
example, there are two hdfs, we can get the delegation tokens for each of them, 
the {{service}} field of the two delegation tokens should be different, we can 
add them to one {{Credentials}} or through the UGI api to add them into one 
{{Credentials}}.
Actually even if we have multiple token files which contain only one token in 
each, we can read them separately through 
{{Credentials#writeTokenStorageFile}}, and add them to one {{Credentials}}.

Back to the original purpose of the JIRA, I don't know why we need to specify 
multiple delegation tokens in one webhdfs://,  the delegation token is used in 
some service to access HDFS on behalf of user, so one hdfs only needs one 
delegation token for one user.  For the distcp example you said, I think 
correct behavior is: user specify delegation token in each webhdfs://, and the 
MR task will add the two delegation tokens to the UGI Credentials of that user. 
  I think this is already supported, I have not tried the distcp on two 
different secured hdfs, if there is some bug, the correct fix is as I said, 
it's not to support multiple delegation tokens in one webhdfs://.
We also should not use HADOOP_TOKEN_FILE_LOCATION to solve the problem.


was (Author: hitliuyi):
{quote}
there is no way for an end user to create a token file with multiple tokens 
inside it, short of building custom code to do it..
{quote}

No, I think we have.   When using existing 
{{Credentials#writeTokenStorageFile}}, all tokens of the credentials will be 
persisted, and {{Credentials#readTokenStorageStream}} will read all tokens too. 
 So what we need to do is to add different tokens to the Credentials, use your 
example, there are two hdfs, we can get the delegation tokens for each of them, 
the {{service}} filed of the two delegation tokens should be different, we can 
add them to one {{Credentials}} or through the UGI api to add them into one 
{{Credentials}}.
Actually even if we have multiple token files which contain only one token in 
each, we can read them separately through 
{{Credentials#writeTokenStorageFile}}, and add them to one {{Credentials}}.

Back to the original purpose of the JIRA, I don't know why we need to specify 
multiple delegation tokens in one webhdfs://,  the delegation token is used in 
some service to access HDFS on behalf of user, so one hdfs only needs one 
delegation token for one user.  For the distcp example you said, I think 
correct behavior is: user specify delegation token in each webhdfs://, and the 
MR task will add the two delegation tokens to the UGI Credentials of that user. 
  I think this is already supported, I have not tried the distcp on two 
different secured hdfs, if there is some bug, the correct fix is as I said, 
it's not to support multiple delegation tokens in one webhdfs://.
We also should not use HADOOP_TOKEN_FILE_LOCATION to solve the problem.

> webhdfs:// needs to support provided delegation tokens
> --
>
> Key: HDFS-7984
> URL: https://issues.apache.org/jira/browse/HDFS-7984
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.0.0
>Reporter: Allen Wittenauer
>Assignee: HeeSoo Kim
>Priority: Blocker
> Attachments: HDFS-7984.patch
>
>
> When using the webhdfs:// filesystem (especially from distcp), we need the 
> ability to inject a delegation token rather than webhdfs initialize its own.  
> This would allow for cross-authentication-zone file system accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7984) webhdfs:// needs to support provided delegation tokens

2015-10-26 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973802#comment-14973802
 ] 

Yi Liu commented on HDFS-7984:
--

{quote}
there is no way for an end user to create a token file with multiple tokens 
inside it, short of building custom code to do it..
{quote}

No, I think we have.   When using existing 
{{Credentials#writeTokenStorageFile}}, all tokens of the credentials will be 
persisted, and {{Credentials#readTokenStorageStream}} will read all tokens too. 
 So what we need to do is to add different tokens to the Credentials, use your 
example, there are two hdfs, we can get the delegation tokens for each of them, 
the {{service}} filed of the two delegation tokens should be different, we can 
add them to one {{Credentials}} or through the UGI api to add them into one 
{{Credentials}}.
Actually even if we have multiple token files which contain only one token in 
each, we can read them separately through 
{{Credentials#writeTokenStorageFile}}, and add them to one {{Credentials}}.

Back to the original purpose of the JIRA, I don't know why we need to specify 
multiple delegation tokens in one webhdfs://,  the delegation token is used in 
some service to access HDFS on behalf of user, so one hdfs only needs one 
delegation token for one user.  For the distcp example you said, I think 
correct behavior is: user specify delegation token in each webhdfs://, and the 
MR task will add the two delegation tokens to the UGI Credentials of that user. 
  I think this is already supported, I have not tried the distcp on two 
different secured hdfs, if there is some bug, the correct fix is as I said, 
it's not to support multiple delegation tokens in one webhdfs://.
We also should not use HADOOP_TOKEN_FILE_LOCATION to solve the problem.

> webhdfs:// needs to support provided delegation tokens
> --
>
> Key: HDFS-7984
> URL: https://issues.apache.org/jira/browse/HDFS-7984
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.0.0
>Reporter: Allen Wittenauer
>Assignee: HeeSoo Kim
>Priority: Blocker
> Attachments: HDFS-7984.patch
>
>
> When using the webhdfs:// filesystem (especially from distcp), we need the 
> ability to inject a delegation token rather than webhdfs initialize its own.  
> This would allow for cross-authentication-zone file system accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-26 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974277#comment-14974277
 ] 

Yi Liu commented on HDFS-9276:
--

So it's the problem of service name of token for HA, there are three entries 
for the Token of NN HA with different service names: fs uri, NN1 uri, NN2 uri. 
The bug is only for HA mode, please don't use that test in the JIRA 
description, it has no relationship with HA. 

Also you need to add a real test to reproduce the issue which only occurs in HA 
in the patch. 

For the patch itself, we don't need {{Token#createPrivateToken}}, can construct 
it directly. 
Rename {{copyFrom}}  to {{publicService}}, {{clonedToken}} to {{privateToken}}


> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 

[jira] [Reopened] (HDFS-9293) FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty 'rpcId',which may cause standby NN too busy to communicate

2015-10-25 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu reopened HDFS-9293:
--

> FSEditLog's  'OpInstanceCache' instance of threadLocal cache exists dirty 
> 'rpcId',which may cause standby NN too busy  to communicate 
> --
>
> Key: HDFS-9293
> URL: https://issues.apache.org/jira/browse/HDFS-9293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.2.0, 2.7.1
>Reporter: 邓飞
>Assignee: 邓飞
> Fix For: 2.7.1
>
>
>   In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
> slowly,and hold the fsnamesystem writelock during the work and the DN's 
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
> which can't send heartbeat  because blocking at process Standby NN Regiest 
> common(FIXED at 2.7.1).
>   Below is the standby NN  stack:
> "Edit log tailer" prio=10 tid=0x7f28fcf35800 nid=0x1a7d runnable 
> [0x7f0dd1d76000]
>java.lang.Thread.State: RUNNABLE
>   at java.util.PriorityQueue.remove(PriorityQueue.java:360)
>   at 
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
>   at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
>   - locked <0x7f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> When apply editLogOp,if the IPC retryCache is found,need  to remove the 
> previous from priorityQueue(O(N)), The updateblock is don't  need record 
> rpcId on editlog except  'client request updatePipeline',but we found many 
> 'UpdateBlocksOp' has repeat ipcId.
>  
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-9293) FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty 'rpcId',which may cause standby NN too busy to communicate

2015-10-25 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-9293.
--
Resolution: Duplicate

> FSEditLog's  'OpInstanceCache' instance of threadLocal cache exists dirty 
> 'rpcId',which may cause standby NN too busy  to communicate 
> --
>
> Key: HDFS-9293
> URL: https://issues.apache.org/jira/browse/HDFS-9293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.2.0, 2.7.1
>Reporter: 邓飞
>Assignee: 邓飞
> Fix For: 2.7.1
>
>
>   In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
> slowly,and hold the fsnamesystem writelock during the work and the DN's 
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
> which can't send heartbeat  because blocking at process Standby NN Regiest 
> common(FIXED at 2.7.1).
>   Below is the standby NN  stack:
> "Edit log tailer" prio=10 tid=0x7f28fcf35800 nid=0x1a7d runnable 
> [0x7f0dd1d76000]
>java.lang.Thread.State: RUNNABLE
>   at java.util.PriorityQueue.remove(PriorityQueue.java:360)
>   at 
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
>   at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
>   - locked <0x7f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> When apply editLogOp,if the IPC retryCache is found,need  to remove the 
> previous from priorityQueue(O(N)), The updateblock is don't  need record 
> rpcId on editlog except  'client request updatePipeline',but we found many 
> 'UpdateBlocksOp' has repeat ipcId.
>  
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973758#comment-14973758
 ] 

Yi Liu commented on HDFS-9276:
--

{quote}
To reproduce the bug, please set the following configuration to Name Node:
dfs.namenode.delegation.token.max-lifetime = 10min
dfs.namenode.delegation.key.update-interval = 3min
dfs.namenode.delegation.token.renew-interval = 3min
The bug will occure after 3 minutes.
{quote}

Your test code can't say anything,  the error msg of "token 
(HDFS_DELEGATION_TOKEN token 330156 for test) is expired" is because you set 
"dfs.namenode.delegation.token.renew-interval" to 3 min but you don't let 
{{test}} user to renew the token. 

I see what you want to do now.  Actually hadoop code is enough to let you do 
what you want to do.  If a user client get a new delegation token, and your 
long running application can accept it, you can update the credentials of 
user's UGI on the server through {{UserGroupInformation#addCredentials}}, it 
will overwrite old tokens by default, of course you should make the service 
name of token is the same if you want to overwrite it.

It's not a bug.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 3min
> {code}
> The bug will occure after 3 minutes.
> The stacktrace is:
> {code}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 330156 for test) is expired
>   at org.apache.hadoop.ipc.Client.call(Client.java:1347)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy9.getFileInfo(Unknown 

[jira] [Comment Edited] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973758#comment-14973758
 ] 

Yi Liu edited comment on HDFS-9276 at 10/26/15 5:39 AM:


{quote}
To reproduce the bug, please set the following configuration to Name Node:
dfs.namenode.delegation.token.max-lifetime = 10min
dfs.namenode.delegation.key.update-interval = 3min
dfs.namenode.delegation.token.renew-interval = 3min
The bug will occure after 3 minutes.
{quote}

Your test code can't say anything,  the error msg of "token 
(HDFS_DELEGATION_TOKEN token 330156 for test) is expired" is because you set 
"dfs.namenode.delegation.token.renew-interval" to 3 min but you don't let 
{{test}} user to renew the token. 

I see what you want to do now, if the same with the later case as I commented 
above.  Actually hadoop code is enough to let you do what you want to do.  If a 
user client get a new delegation token, and your long running application can 
accept it, you can update the credentials of user's UGI on the server through 
{{UserGroupInformation#addCredentials}}, it will overwrite old tokens by 
default, of course you should make the service name of token is the same if you 
want to overwrite it.

It's not a bug.


was (Author: hitliuyi):
{quote}
To reproduce the bug, please set the following configuration to Name Node:
dfs.namenode.delegation.token.max-lifetime = 10min
dfs.namenode.delegation.key.update-interval = 3min
dfs.namenode.delegation.token.renew-interval = 3min
The bug will occure after 3 minutes.
{quote}

Your test code can't say anything,  the error msg of "token 
(HDFS_DELEGATION_TOKEN token 330156 for test) is expired" is because you set 
"dfs.namenode.delegation.token.renew-interval" to 3 min but you don't let 
{{test}} user to renew the token. 

I see what you want to do now.  Actually hadoop code is enough to let you do 
what you want to do.  If a user client get a new delegation token, and your 
long running application can accept it, you can update the credentials of 
user's UGI on the server through {{UserGroupInformation#addCredentials}}, it 
will overwrite old tokens by default, of course you should make the service 
name of token is the same if you want to overwrite it.

It's not a bug.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   

[jira] [Comment Edited] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973758#comment-14973758
 ] 

Yi Liu edited comment on HDFS-9276 at 10/26/15 5:41 AM:


{quote}
To reproduce the bug, please set the following configuration to Name Node:
dfs.namenode.delegation.token.max-lifetime = 10min
dfs.namenode.delegation.key.update-interval = 3min
dfs.namenode.delegation.token.renew-interval = 3min
The bug will occure after 3 minutes.
{quote}

Your test code can't say anything,  the error msg of "token 
(HDFS_DELEGATION_TOKEN token 330156 for test) is expired" is because you set 
"dfs.namenode.delegation.token.renew-interval" to 3 min but you don't let 
{{test}} user to renew the token. 

I see what you want to do now, it's the same with the later case of what I 
commented above.  Actually hadoop code is enough to let you do what you want to 
do.  If a user client get a new delegation token, and your long running 
application can accept it, you can update the credentials of user's UGI on the 
server through {{UserGroupInformation#addCredentials}}, it will overwrite old 
tokens by default, of course you should make the service name of token is the 
same if you want to overwrite it.

It's not a bug.


was (Author: hitliuyi):
{quote}
To reproduce the bug, please set the following configuration to Name Node:
dfs.namenode.delegation.token.max-lifetime = 10min
dfs.namenode.delegation.key.update-interval = 3min
dfs.namenode.delegation.token.renew-interval = 3min
The bug will occure after 3 minutes.
{quote}

Your test code can't say anything,  the error msg of "token 
(HDFS_DELEGATION_TOKEN token 330156 for test) is expired" is because you set 
"dfs.namenode.delegation.token.renew-interval" to 3 min but you don't let 
{{test}} user to renew the token. 

I see what you want to do now, if the same with the later case as I commented 
above.  Actually hadoop code is enough to let you do what you want to do.  If a 
user client get a new delegation token, and your long running application can 
accept it, you can update the credentials of user's UGI on the server through 
{{UserGroupInformation#addCredentials}}, it will overwrite old tokens by 
default, of course you should make the service name of token is the same if you 
want to overwrite it.

It's not a bug.

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
>

[jira] [Commented] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973249#comment-14973249
 ] 

Yi Liu commented on HDFS-9276:
--

[~marsishandsome],  Agree with Steve you need to move this to Hadoop common if 
the patch only contains change in common.  Before this, I think you may have a 
mistake about how to use delegation token.

Do you want to update the delegation token through 
{{FileSystem#addDelegationTokens}}?  It will not get new delegation token again 
if old one exists in the credentials, also it may be more complicate than what 
you think.
Actually I am curious about how you write your long running application. Is 
your application just an user client or running on YARN or a separate service?  
If your application is just one user client, I mean it's not a service which is 
acessed by many user clients, then you still need to user Kerberos instead of 
delegationToken, but if your application is a real service which serves user 
clients, then the delegation token is the right one. The delegation token is 
used in your service to access HDFS on behalf the user, usually your 
application service can renew the delegation token, the application service 
itself can't get a new delegation token for some specific user.  If your 
application service runs longer than the maximum renewable date of user's 
delegationToken,  one way is the user gets a new delegation token and your 
application service supports some mechanism to let user to update the 
delegation token and then refresh the token in that user's UGI's credential.  
Another way is to support proxy user privileges in your running application, 
refer to YARN-2704.   Are you in the correct way?

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> ugi1.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> val fs = FileSystem.get(new Configuration())
> fs.addDelegationTokens("test", creds1)
> null
>   }
> })
> val ugi = UserGroupInformation.createRemoteUser("test")
> ugi.addCredentials(creds1)
> ugi.doAs(new PrivilegedExceptionAction[Void] {
>   // Get a copy of the credentials
>   override def run(): Void = {
> var i = 0
> while (true) {
>   val creds1 = new org.apache.hadoop.security.Credentials()
>   val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
>   ugi1.doAs(new PrivilegedExceptionAction[Void] {
> // Get a copy of the credentials
> override def run(): Void = {
>   val fs = FileSystem.get(new Configuration())
>   fs.addDelegationTokens("test", creds1)
>   null
> }
>   })
>   UserGroupInformation.getCurrentUser.addCredentials(creds1)
>   val fs = FileSystem.get( new Configuration())
>   i += 1
>   println()
>   println(i)
>   println(fs.listFiles(new Path("/user"), false))
>   Thread.sleep(60 * 1000)
> }
> null
>   }
> })
>   }
> }
> {code}
> To reproduce the bug, please set the following configuration to Name Node:
> {code}
> dfs.namenode.delegation.token.max-lifetime = 10min
> dfs.namenode.delegation.key.update-interval = 3min
> dfs.namenode.delegation.token.renew-interval = 

[jira] [Comment Edited] (HDFS-9276) Failed to Update HDFS Delegation Token for long running application in HA mode

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973249#comment-14973249
 ] 

Yi Liu edited comment on HDFS-9276 at 10/25/15 1:57 PM:


[~marsishandsome],  Agree with Steve you need to move this to Hadoop common if 
the patch only contains change in common.  Before this, I think you may have a 
mistake about how to use delegation token.

Do you want to update the delegation token through 
{{FileSystem#addDelegationTokens}}?  It will not get new delegation token again 
if old one exists in the credentials, also it may be more complicate than what 
you think.
Actually I am curious about how you write your long running application. Is 
your application just an user client or a separate service?  If your 
application is just one user client, I mean it's not a service which is 
accessed by many user clients, then you still need to user Kerberos instead of 
delegationToken, but if your application is a real service which serves user 
clients, then the delegation token is the right one. The delegation token is 
used in your service to access HDFS on behalf the user, usually your 
application service can renew the delegation token, the application service 
itself can't get a new delegation token for some specific user.  If your 
application service runs longer than the maximum renewable date of user's 
delegationToken,  one way is the user gets a new delegation token and your 
application service supports some mechanism to let user to update the 
delegation token and then refresh the token in that user's UGI's credential.  
Another way is to support proxy user privileges in your running application, 
refer to YARN-2704.   Are you in the correct way?


was (Author: hitliuyi):
[~marsishandsome],  Agree with Steve you need to move this to Hadoop common if 
the patch only contains change in common.  Before this, I think you may have a 
mistake about how to use delegation token.

Do you want to update the delegation token through 
{{FileSystem#addDelegationTokens}}?  It will not get new delegation token again 
if old one exists in the credentials, also it may be more complicate than what 
you think.
Actually I am curious about how you write your long running application. Is 
your application just an user client or running on YARN or a separate service?  
If your application is just one user client, I mean it's not a service which is 
acessed by many user clients, then you still need to user Kerberos instead of 
delegationToken, but if your application is a real service which serves user 
clients, then the delegation token is the right one. The delegation token is 
used in your service to access HDFS on behalf the user, usually your 
application service can renew the delegation token, the application service 
itself can't get a new delegation token for some specific user.  If your 
application service runs longer than the maximum renewable date of user's 
delegationToken,  one way is the user gets a new delegation token and your 
application service supports some mechanism to let user to update the 
delegation token and then refresh the token in that user's UGI's credential.  
Another way is to support proxy user privileges in your running application, 
refer to YARN-2704.   Are you in the correct way?

> Failed to Update HDFS Delegation Token for long running application in HA mode
> --
>
> Key: HDFS-9276
> URL: https://issues.apache.org/jira/browse/HDFS-9276
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs, ha, security
>Affects Versions: 2.7.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
> Attachments: HDFS-9276.01.patch, HDFS-9276.02.patch, 
> HDFS-9276.03.patch, debug1.PNG, debug2.PNG
>
>
> The Scenario is as follows:
> 1. NameNode HA is enabled.
> 2. Kerberos is enabled.
> 3. HDFS Delegation Token (not Keytab or TGT) is used to communicate with 
> NameNode.
> 4. We want to update the HDFS Delegation Token for long running applicatons. 
> HDFS Client will generate private tokens for each NameNode. When we update 
> the HDFS Delegation Token, these private tokens will not be updated, which 
> will cause token expired.
> This bug can be reproduced by the following program:
> {code}
> import java.security.PrivilegedExceptionAction
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.hadoop.security.UserGroupInformation
> object HadoopKerberosTest {
>   def main(args: Array[String]): Unit = {
> val keytab = "/path/to/keytab/xxx.keytab"
> val principal = "x...@abc.com"
> val creds1 = new org.apache.hadoop.security.Credentials()
> val ugi1 = 
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
> 

[jira] [Commented] (HDFS-7984) webhdfs:// needs to support provided delegation tokens

2015-10-25 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973256#comment-14973256
 ] 

Yi Liu commented on HDFS-7984:
--

{quote}
String fileLocation = System.getenv(HADOOP_TOKEN_FILE_LOCATION);
...
Credentials cred = Credentials.readTokenStorageFile(
{quote}

The HADOOP_TOKEN_FILE_LOCATION already support multiple tokens.

> webhdfs:// needs to support provided delegation tokens
> --
>
> Key: HDFS-7984
> URL: https://issues.apache.org/jira/browse/HDFS-7984
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.0.0
>Reporter: Allen Wittenauer
>Assignee: HeeSoo Kim
>Priority: Blocker
> Attachments: HDFS-7984.patch
>
>
> When using the webhdfs:// filesystem (especially from distcp), we need the 
> ability to inject a delegation token rather than webhdfs initialize its own.  
> This would allow for cross-authentication-zone file system accesses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9302) WebHDFS throws NullPointerException if newLength is not provided

2015-10-24 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973001#comment-14973001
 ] 

Yi Liu commented on HDFS-9302:
--

[~Karthik Palaniappan],  thanks for reporting this, it's better to return a 
better exception and error msg to user if there is no {{newLength}}.
This can be done by checking whether newLength is null string 
(NewLengthParam#DEFAULT) in {{NamenodeWebHdfsMethods}}, feel free to upload a 
patch to fix it.

In the document, the newLength is already a required parameter, for any 
optional parameter, there is []

> WebHDFS throws NullPointerException if newLength is not provided
> 
>
> Key: HDFS-9302
> URL: https://issues.apache.org/jira/browse/HDFS-9302
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.7.1
> Environment: Centos6
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> $ curl -X POST "http://namenode:50070/webhdfs/v1/foo?op=truncate;
> {"RemoteException":{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException","message":null}}
> We should change newLength to be a required parameter in the webhdfs 
> documentation 
> (https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#New_Length),
>  and throw an IllegalArgumentException if isn't provided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968754#comment-14968754
 ] 

Yi Liu commented on HDFS-9053:
--

Thanks for the discussion, Nicholas.

{quote}
BTW, why do we need two arrays for elements and children. I think BTree 
implementation naturally needs only one array.
{quote}
Actually most implementations I saw use two array. 
One array stores the elements, and another array stores the reference to 
children nodes (childrenSize == elementsSize+1 for non leaf nodes). 
Having them in one array makes: 1) the array contains two different types, I 
think it's strange. 2) more complicated logic and not easy to understand.

{quote}
When the #children is small, ArrayList is just fine. Let's use BTree only when 
#children is larger than a threshold
{quote}
Yeah, I ever did this as you suggested, please see {{005}} patch, I added a 
{{SortedCollection}} which wraps a custom array list implementation and the 
btree. 
- I didn't do as your suggestion of "The field in INodeDirectory is List 
children which may refer to either an ArrayList or a BTreeList. We may replace 
the list at runtime", is because:  It's hard to use a List, since when we 
use ArrayList, we do searching before index/delete and access through index, 
but BTree is in-order and we don't access through index. 

I personally think this approach is a bit complex. Do you have further 
suggestion about it? Nicholas.

On the other hand, back to the btree, Jing, I see you are not in favor of btree 
extend Node, is it possible to make some change and then you can accept? Any 
suggestions?

Appreciate you two to spend time on the discussion, thanks a lot!

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968797#comment-14968797
 ] 

Yi Liu commented on HDFS-9275:
--

Walter, please add some description in the JIRA description part about what 
issue in the Test you want to fix? It's better to have Jenkins failure link if 
there is.

I see you replace {{setDataNodesDead}} and {{readReplica}} with some existing 
methods in other Test Utils, that's good.

In the tests, originally I restart the datanodes at the end of tests because I 
intended to share same mini cluster to make tests faster, but I forgot to use 
{{@BeforeClass}} instead of {{Before}} which will do the restart for each test. 
 I am OK to just remove starting shutdown DNs at the end.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7087) Ability to list /.reserved

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970062#comment-14970062
 ] 

Yi Liu commented on HDFS-7087:
--

I think there is no problem, thanks Andrew, the cherry-pick shows no conflicts, 
I think we usually don't check the compiling before committing if there is no 
patch conflicts. Then we fix it if we find compile error later.

Thanks Xiao Chen and Jason.

> Ability to list /.reserved
> --
>
> Key: HDFS-7087
> URL: https://issues.apache.org/jira/browse/HDFS-7087
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Andrew Wang
>Assignee: Xiao Chen
> Fix For: 2.8.0
>
> Attachments: HDFS-7087.001.patch, HDFS-7087.002.patch, 
> HDFS-7087.003.branch-2.patch, HDFS-7087.003.patch, HDFS-7087.draft.patch
>
>
> We have two special paths within /.reserved now, /.reserved/.inodes and 
> /.reserved/raw. It seems like we should be able to list /.reserved to see 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9275) Fix TestRecoverStripedFile

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970093#comment-14970093
 ] 

Yi Liu commented on HDFS-9275:
--

Thanks Walter for the update, I will review it tomorrow since I'm OOO today if 
there is no other review.

> Fix TestRecoverStripedFile
> --
>
> Key: HDFS-9275
> URL: https://issues.apache.org/jira/browse/HDFS-9275
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Walter Su
>Assignee: Walter Su
> Attachments: HDFS-9275.01.patch, HDFS-9275.02.patch, 
> HDFS-9275.03.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969206#comment-14969206
 ] 

Yi Liu commented on HDFS-9053:
--

{quote}
The advantages of having one array are 1) smaller memory footprint and 2) not 
requiring to maintain the invariant above.
{quote}
Agree, my concern is it requires one array contains two different data types 
and more complex code logic. 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969202#comment-14969202
 ] 

Yi Liu commented on HDFS-9053:
--

{quote}
Could you give some examples?
{quote}
https://github.com/google/btree
The google btree implementation using Go language. 

https://code.google.com/p/cpp-btree/
The btree in cpp

Also there are some other btree implementations in github which may be not 
popular.




> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968277#comment-14968277
 ] 

Yi Liu commented on HDFS-9053:
--

Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 
bytes per directory is fine. E.g., for a big cluster with >100M 
files/directories, if we only have 1M directories, then we only increase the 
heap size by 44 MB. Compared with the total heap size this may be just <0.1% 
increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite 
small for directory. 

I know some users have more than 500M files, and some directory contains quite 
lots of files, usually this kind of large directory is accessed most 
frequently, just as "barrel principle", this becomes the bottleneck of NN.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968277#comment-14968277
 ] 

Yi Liu edited comment on HDFS-9053 at 10/22/15 12:51 AM:
-

Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 
bytes per directory is fine. E.g., for a big cluster with >100M 
files/directories, if we only have 1M directories, then we only increase the 
heap size by 44 MB. Compared with the total heap size this may be just <0.1% 
increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite 
small for directory.  The new approach also just uses btree only, it's NOT to 
switch between ArrayList and btree, please look at the {{007}} patch.

I know some users have more than 500M files, and some directory contains quite 
lots of files, usually this kind of large directory is accessed most 
frequently, just as "barrel principle", this becomes the bottleneck of NN.


was (Author: hitliuyi):
Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 
bytes per directory is fine. E.g., for a big cluster with >100M 
files/directories, if we only have 1M directories, then we only increase the 
heap size by 44 MB. Compared with the total heap size this may be just <0.1% 
increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite 
small for directory. 

I know some users have more than 500M files, and some directory contains quite 
lots of files, usually this kind of large directory is accessed most 
frequently, just as "barrel principle", this becomes the bottleneck of NN.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7087) Ability to list /.reserved

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968396#comment-14968396
 ] 

Yi Liu commented on HDFS-7087:
--

That's right. I will help to revert it from branch-2 and re-commit it shortly.

> Ability to list /.reserved
> --
>
> Key: HDFS-7087
> URL: https://issues.apache.org/jira/browse/HDFS-7087
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Andrew Wang
>Assignee: Xiao Chen
> Fix For: 2.8.0
>
> Attachments: HDFS-7087.001.patch, HDFS-7087.002.patch, 
> HDFS-7087.003.patch, HDFS-7087.draft.patch
>
>
> We have two special paths within /.reserved now, /.reserved/.inodes and 
> /.reserved/raw. It seems like we should be able to list /.reserved to see 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7087) Ability to list /.reserved

2015-10-21 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-7087:
-
Attachment: HDFS-7087.003.branch-2.patch

recommit to branch-2, and attach the patch of branch-2 for track.

> Ability to list /.reserved
> --
>
> Key: HDFS-7087
> URL: https://issues.apache.org/jira/browse/HDFS-7087
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Andrew Wang
>Assignee: Xiao Chen
> Fix For: 2.8.0
>
> Attachments: HDFS-7087.001.patch, HDFS-7087.002.patch, 
> HDFS-7087.003.branch-2.patch, HDFS-7087.003.patch, HDFS-7087.draft.patch
>
>
> We have two special paths within /.reserved now, /.reserved/.inodes and 
> /.reserved/raw. It seems like we should be able to list /.reserved to see 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968445#comment-14968445
 ] 

Yi Liu commented on HDFS-9053:
--

Got it now. Thanks a lot [~jingzhao]!
Let's recalculate again.  For original btree implementation, it increases 44 
bytes which is my estimation and ignore alignment.  Now I have looked into the 
real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. 
And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes 
alignment/padding gap = 8 bytes.
So finally if we don't let BTree extend Node, it increases *32 bytes* for a 
directly.

32 bytes memory increment for a directory is fine for me and was my original 
thought.  As in your example, if we have 1M directories, then we only increase 
heap size by 32 MB.  I also respect Nicholas' comment, if we all think it's OK, 
I am happy to do this :).  


{noformat}
org.apache.hadoop.util.BTree object internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 4   int BTree.degree   N/A
 20 4   int BTree.size N/A
 24 4   int BTree.modCount N/A
 28 4   (alignment/padding gap)N/A
 32 8  Node BTree.root N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
{noformat}

{noformat}
org.apache.hadoop.util.BTree.Node object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
{noformat}

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968445#comment-14968445
 ] 

Yi Liu edited comment on HDFS-9053 at 10/22/15 3:34 AM:


Got it now. Thanks a lot [~jingzhao]!
Let's recalculate again.  For original btree implementation, it increases 44 
bytes which is my estimation and ignore alignment.  Now I have looked into the 
real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. 
And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes 
alignment/padding gap = 8 bytes.
So finally if we don't let BTree extend Node, it increases *32 bytes* for a 
directly.

32 bytes memory increment for a directory is fine for me and was also same as 
my original thought.  As in your example, if we have 1M directories, then we 
only increase heap size by 32 MB.  I also respect Nicholas' comment, if we all 
think it's OK, I am happy to do this :).  


{noformat}
org.apache.hadoop.util.BTree object internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 4   int BTree.degree   N/A
 20 4   int BTree.size N/A
 24 4   int BTree.modCount N/A
 28 4   (alignment/padding gap)N/A
 32 8  Node BTree.root N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
{noformat}

{noformat}
org.apache.hadoop.util.BTree.Node object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
{noformat}


was (Author: hitliuyi):
Got it now. Thanks a lot [~jingzhao]!
Let's recalculate again.  For original btree implementation, it increases 44 
bytes which is my estimation and ignore alignment.  Now I have looked into the 
real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. 
And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes 
alignment/padding gap = 8 bytes.
So finally if we don't let BTree extend Node, it increases *32 bytes* for a 
directly.

32 bytes memory increment for a directory is fine for me and was my original 
thought.  As in your example, if we have 1M directories, then we only increase 
heap size by 32 MB.  I also respect Nicholas' comment, if we all think it's OK, 
I am happy to do this :).  


{noformat}
org.apache.hadoop.util.BTree object internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 4   int BTree.degree   N/A
 20 4   int BTree.size N/A
 24 4   int BTree.modCount N/A
 28 4   (alignment/padding gap)N/A
 32 8  Node BTree.root N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
{noformat}

{noformat}
org.apache.hadoop.util.BTree.Node object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
{noformat}

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, 

[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-20 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964775#comment-14964775
 ] 

Yi Liu commented on HDFS-9053:
--

The test failures are not related.
I plan to support {{shrinkable}} in future task.

Hi [~jingzhao], could you help to review the new patch again? (The diff is 
small compared with previous patch you reviewed)
[~szetszwo], I can change the default degree to 2K if you like.

BTW, I have checked the logic of btree many times and done a lot of tests, and 
I am confident of the correctness of btree. 

Thanks a lot.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9274) Default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec should be consistent

2015-10-20 Thread Yi Liu (JIRA)
Yi Liu created HDFS-9274:


 Summary: Default value of 
dfs.datanode.directoryscan.throttle.limit.ms.per.sec should be consistent
 Key: HDFS-9274
 URL: https://issues.apache.org/jira/browse/HDFS-9274
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Yi Liu
Assignee: Yi Liu
Priority: Trivial


Always see following error log while running:
{noformat}
ERROR datanode.DirectoryScanner (DirectoryScanner.java:(430)) - 
dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value below 1 
ms/sec. Assuming default value of 1000
{noformat}

{code}

  dfs.datanode.directoryscan.throttle.limit.ms.per.sec
  0
...
{code}
The default value should be 1000 and consistent with 
DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9274) Default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec should be consistent

2015-10-20 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9274:
-
Attachment: HDFS-9274.001.patch

> Default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec should 
> be consistent
> --
>
> Key: HDFS-9274
> URL: https://issues.apache.org/jira/browse/HDFS-9274
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Trivial
> Attachments: HDFS-9274.001.patch
>
>
> Always see following error log while running:
> {noformat}
> ERROR datanode.DirectoryScanner (DirectoryScanner.java:(430)) - 
> dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value below 1 
> ms/sec. Assuming default value of 1000
> {noformat}
> {code}
> 
>   dfs.datanode.directoryscan.throttle.limit.ms.per.sec
>   0
> ...
> {code}
> The default value should be 1000 and consistent with 
> DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9274) Default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec should be consistent

2015-10-20 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9274:
-
Status: Patch Available  (was: Open)

> Default value of dfs.datanode.directoryscan.throttle.limit.ms.per.sec should 
> be consistent
> --
>
> Key: HDFS-9274
> URL: https://issues.apache.org/jira/browse/HDFS-9274
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Trivial
> Attachments: HDFS-9274.001.patch
>
>
> Always see following error log while running:
> {noformat}
> ERROR datanode.DirectoryScanner (DirectoryScanner.java:(430)) - 
> dfs.datanode.directoryscan.throttle.limit.ms.per.sec set to value below 1 
> ms/sec. Assuming default value of 1000
> {noformat}
> {code}
> 
>   dfs.datanode.directoryscan.throttle.limit.ms.per.sec
>   0
> ...
> {code}
> The default value should be 1000 and consistent with 
> DFS_DATANODE_DIRECTORYSCAN_THROTTLE_LIMIT_MS_PER_SEC_DEFAULT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-19 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9053:
-
Attachment: HDFS-9053.007.patch

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-19 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964433#comment-14964433
 ] 

Yi Liu commented on HDFS-9053:
--

Update the patch:
# Fix some checkstyles and the whitespace.
# Update few method names in btree to be more readable.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9208) Disabling atime may fail clients like distCp

2015-10-19 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9208:
-
   Resolution: Fixed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2. Thanks [~kihwal] for the work and [~cnauroth] 
for the review.

> Disabling atime may fail clients like distCp
> 
>
> Key: HDFS-9208
> URL: https://issues.apache.org/jira/browse/HDFS-9208
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Fix For: 2.8.0
>
> Attachments: HDFS-9208.patch, HDFS-9208.v2.patch
>
>
> When atime is disabled, {{setTimes()}} throws an exception if the passed-in 
> atime is not -1.  But since atime is not -1, distCp fails when it tries to 
> set the mtime and atime. 
> There are several options:
> 1) make distCp check for 0 atime and call {{setTimes()}} with -1. I am not 
> very enthusiastic about it.
> 2) make NN also accept 0 atime in addition to -1, when the atime support is 
> disabled.
> 3) support setting mtime & atime regardless of the atime support.  The main 
> reason why atime is disabled is to avoid edit logging/syncing during 
> {{getBlockLocations()}} read calls. Explicit setting can be allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7964) Add support for async edit logging

2015-10-18 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962767#comment-14962767
 ] 

Yi Liu commented on HDFS-7964:
--

1. That's right.
2. You can keep it.

> Add support for async edit logging
> --
>
> Key: HDFS-7964
> URL: https://issues.apache.org/jira/browse/HDFS-7964
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.2-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-7964.patch, HDFS-7964.patch
>
>
> Edit logging is a major source of contention within the NN.  LogEdit is 
> called within the namespace write log, while logSync is called outside of the 
> lock to allow greater concurrency.  The handler thread remains busy until 
> logSync returns to provide the client with a durability guarantee for the 
> response.
> Write heavy RPC load and/or slow IO causes handlers to stall in logSync.  
> Although the write lock is not held, readers are limited/starved and the call 
> queue fills.  Combining an edit log thread with postponed RPC responses from 
> HADOOP-10300 will provide the same durability guarantee but immediately free 
> up the handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8398) Erasure Coding: Correctly calculate last striped block length in DFSStripedInputStream if it's under construction.

2015-10-16 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-8398.
--
Resolution: Duplicate

> Erasure Coding: Correctly calculate last striped block length in 
> DFSStripedInputStream if it's under construction.
> --
>
> Key: HDFS-8398
> URL: https://issues.apache.org/jira/browse/HDFS-8398
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Yi Liu
>Assignee: Yi Liu
>
> Currently in DFSStripedInputStream, for continuous block, if it's under 
> construction, we need to read the block replica length from one of datanode 
> and use it as last block length.
> For striped block, we need to read the length of all internal data blocks of 
> the striped group, then add them correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8059) Erasure coding: revisit how to store EC schema and cellSize in NameNode

2015-10-16 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-8059.
--
Resolution: Duplicate

> Erasure coding: revisit how to store EC schema and cellSize in NameNode
> ---
>
> Key: HDFS-8059
> URL: https://issues.apache.org/jira/browse/HDFS-8059
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: HDFS-7285
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: HDFS-8059.001.patch
>
>
> Move {{dataBlockNum}} and {{parityBlockNum}} from BlockInfoStriped to 
> INodeFile, and store them in {{FileWithStripedBlocksFeature}}.
> Ideally these two nums are the same for all striped blocks in a file, and 
> store them in BlockInfoStriped will waste NN memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-7964) Add support for async edit logging

2015-10-16 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960114#comment-14960114
 ] 

Yi Liu edited comment on HDFS-7964 at 10/16/15 6:40 AM:


Thanks [~daryn] for the work.

Further comments:

*1.* In FSEditLogAsync#run
{code}
@Override
  public void run() {
try {
  while (true) {

if (doSync) {
  ...
logSync(getLastWrittenTxId());
  ...
{code}
I think it's better to pass the txid of current edit to {{logSync}}, not need 
to wait for all txid written. Then it's more efficient and client can get more 
faster response? 

*2.*
{code}
-log4j.rootLogger=OFF, CONSOLE
+log4j.rootLogger=DEBUG, CONSOLE
{code}
Any reason to change it?

*3.*
{code}
call.abortResponse(syncEx);
{code}
Seems this code is not available?


was (Author: hitliuyi):
Thanks [~daryn] for the work.

Further comments:

*1.* In FSEditLogAsync#run
{code}
@Override
  public void run() {
try {
  while (true) {

if (doSync) {
  ...
logSync(getLastWrittenTxId());
  ...
{code}
I think it's better to pass the txid of current edit to {{logSync}}, not need 
to wait for all txid written. Then it's more efficient and client can get more 
faster response? 

*2.*
{code}
+  editsBatchedInSync = txid - synctxid - 1;
{code}
Isn't it "txid - synctxid"?   The txid is the max txid written, and synctxid is 
the max txid already synced, suppose txid = 20, synctxid = 10, then the 
editsBatchedInSync should be (txid - synctxid) = (20 - 10) = 10.   Also you can 
get it from the existing log message:
{code}
final String msg =
"Could not sync enough journals to persistent storage " +
"due to " + e.getMessage() + ". " +
"Unsynced transactions: " + (txid - synctxid);
{code}

*3.*
{code}
-log4j.rootLogger=OFF, CONSOLE
+log4j.rootLogger=DEBUG, CONSOLE
{code}
Any reason to change it?

*4.*
{code}
call.abortResponse(syncEx);
{code}
Seems this code is not available?

> Add support for async edit logging
> --
>
> Key: HDFS-7964
> URL: https://issues.apache.org/jira/browse/HDFS-7964
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.2-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-7964.patch, HDFS-7964.patch
>
>
> Edit logging is a major source of contention within the NN.  LogEdit is 
> called within the namespace write log, while logSync is called outside of the 
> lock to allow greater concurrency.  The handler thread remains busy until 
> logSync returns to provide the client with a durability guarantee for the 
> response.
> Write heavy RPC load and/or slow IO causes handlers to stall in logSync.  
> Although the write lock is not held, readers are limited/starved and the call 
> queue fills.  Combining an edit log thread with postponed RPC responses from 
> HADOOP-10300 will provide the same durability guarantee but immediately free 
> up the handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8087) Erasure Coding: Add more EC zone management APIs (get/list EC zone(s))

2015-10-16 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-8087.
--
Resolution: Invalid

Now, getErasureCodingPolicies and getErasureCodingPolicy are implemented, so 
close it as Invalid.

> Erasure Coding: Add more EC zone management APIs (get/list EC zone(s))
> --
>
> Key: HDFS-8087
> URL: https://issues.apache.org/jira/browse/HDFS-8087
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Yi Liu
>Assignee: Yi Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8376) Erasure Coding: Update last cellsize calculation according to whether the erasure codec has chunk boundary

2015-10-16 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-8376.
--
Resolution: Invalid

> Erasure Coding: Update last cellsize calculation according to whether the 
> erasure codec has chunk boundary
> --
>
> Key: HDFS-8376
> URL: https://issues.apache.org/jira/browse/HDFS-8376
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Yi Liu
>Assignee: Yi Liu
>
> Current calculation for last cell size is as following. For parity cell, the 
> last cell size is the same as the first data cell.  But some erasure codec 
> has chunk boundary, then the last cellsize for parity block is the codec 
> chunk size.
> {code}
> private static int lastCellSize(int size, int cellSize, int numDataBlocks,
>   int i) {
> if (i < numDataBlocks) {
>   // parity block size (i.e. i >= numDataBlocks) is the same as 
>   // the first data block size (i.e. i = 0).
>   size -= i*cellSize;
>   if (size < 0) {
> size = 0;
>   }
> }
> return size > cellSize? cellSize: size;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-15 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959955#comment-14959955
 ] 

Yi Liu edited comment on HDFS-9053 at 10/16/15 12:56 AM:
-

Thanks [~szetszwo] for the comments.

{quote}
>> 24 8 Object[] Node.elements  N/A
>> 32 8 Object[] Node.children  N/A
It only counts the reference but array objects are not counted. So the BTree 
overhead is still a lot more than ArrayList.
{quote}

For small elements size (assume # < max degree which is 2047), the {{children}} 
is null reference, so there is no array object of {{children}} here, just 8 
bytes for null reference. And for {{elements}}, {{ArrayList}} also has it. So 
as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for 
small size elements.  Do I miss something?


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
>> 24 8 Object[] Node.elements  N/A
>> 32 8 Object[] Node.children  N/A
It only counts the reference but array objects are not counted. So the BTree 
overhead is still a lot more than ArrayList.
{quote}

For small elements size (assume # < max degree which is 2047), the {{children}} 
is null reference, so there is no array object of {{children}} here,  and for 
{{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} 
increases 8 bytes compared with {{ArrayList}} for small size elements.  Do I 
miss something?

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7964) Add support for async edit logging

2015-10-15 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-7964:
-
Status: Patch Available  (was: Open)

> Add support for async edit logging
> --
>
> Key: HDFS-7964
> URL: https://issues.apache.org/jira/browse/HDFS-7964
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.2-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-7964.patch, HDFS-7964.patch
>
>
> Edit logging is a major source of contention within the NN.  LogEdit is 
> called within the namespace write log, while logSync is called outside of the 
> lock to allow greater concurrency.  The handler thread remains busy until 
> logSync returns to provide the client with a durability guarantee for the 
> response.
> Write heavy RPC load and/or slow IO causes handlers to stall in logSync.  
> Although the write lock is not held, readers are limited/starved and the call 
> queue fills.  Combining an edit log thread with postponed RPC responses from 
> HADOOP-10300 will provide the same durability guarantee but immediately free 
> up the handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7964) Add support for async edit logging

2015-10-15 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960114#comment-14960114
 ] 

Yi Liu commented on HDFS-7964:
--

Thanks [~daryn] for the work.

Further comments:

*1.* In FSEditLogAsync#run
{code}
@Override
  public void run() {
try {
  while (true) {

if (doSync) {
  ...
logSync(getLastWrittenTxId());
  ...
{code}
I think it's better to pass the txid of current edit to {{logSync}}, not need 
to wait for all txid written. Then it's more efficient and client can get more 
faster response? 

*2.*
{code}
+  editsBatchedInSync = txid - synctxid - 1;
{code}
Isn't it "txid - synctxid"?   The txid is the max txid written, and synctxid is 
the max txid already synced, suppose txid = 20, synctxid = 10, then the 
editsBatchedInSync should be (txid - synctxid) = (20 - 10) = 10.   Also you can 
get it from the existing log message:
{code}
final String msg =
"Could not sync enough journals to persistent storage " +
"due to " + e.getMessage() + ". " +
"Unsynced transactions: " + (txid - synctxid);
{code}

*3.*
{code}
-log4j.rootLogger=OFF, CONSOLE
+log4j.rootLogger=DEBUG, CONSOLE
{code}
Any reason to change it?

*4.*
{code}
call.abortResponse(syncEx);
{code}
Seems this code is not available?

> Add support for async edit logging
> --
>
> Key: HDFS-7964
> URL: https://issues.apache.org/jira/browse/HDFS-7964
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.2-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-7964.patch, HDFS-7964.patch
>
>
> Edit logging is a major source of contention within the NN.  LogEdit is 
> called within the namespace write log, while logSync is called outside of the 
> lock to allow greater concurrency.  The handler thread remains busy until 
> logSync returns to provide the client with a durability guarantee for the 
> response.
> Write heavy RPC load and/or slow IO causes handlers to stall in logSync.  
> Although the write lock is not held, readers are limited/starved and the call 
> queue fills.  Combining an edit log thread with postponed RPC responses from 
> HADOOP-10300 will provide the same durability guarantee but immediately free 
> up the handlers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-15 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959988#comment-14959988
 ] 

Yi Liu commented on HDFS-9053:
--

Hi Nicholas, sorry, I may bypass some description here.  For 2047, I wanted to 
say it just an example threshold of small elements size. Currently the small 
elements size is an assumption value, actually we can set the degree of BTree 
as any value we want to.  If we want 4K as the threshold of small elements 
size, we can set the degree of B-Tree to 2K, then max degree is (4K - 1).   (I 
should make the description clear..)
Thanks.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-15 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959955#comment-14959955
 ] 

Yi Liu commented on HDFS-9053:
--

Thanks [~szetszwo] for the comments.

{quote}
>> 24 8 Object[] Node.elements  N/A
>> 32 8 Object[] Node.children  N/A
It only counts the reference but array objects are not counted. So the BTree 
overhead is still a lot more than ArrayList.
{quote}

For small elements size (assume # < max degree which is 2047), the {{children}} 
is null reference, so there is no array object of {{children}} here,  and for 
{{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} 
increases 8 bytes compared with {{ArrayList}} for small size elements.  Do I 
miss something?

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9221) HdfsServerConstants#ReplicaState#getState should avoid calling values() since it creates a temporary array

2015-10-12 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9221:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2, thanks Staffan for the contribution, and 
Colin, Daryn for the review.

> HdfsServerConstants#ReplicaState#getState should avoid calling values() since 
> it creates a temporary array
> --
>
> Key: HDFS-9221
> URL: https://issues.apache.org/jira/browse/HDFS-9221
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: performance
>Affects Versions: 2.7.1
>Reporter: Staffan Friberg
>Assignee: Staffan Friberg
> Fix For: 2.8.0
>
> Attachments: HADOOP-9221.001.patch
>
>
> When the BufferDecoder in BlockListAsLongs converts the stored value to a 
> ReplicaState enum it calls ReplicaState.getState(int) unfortunately this 
> method creates a ReplicaState[] for each call since it calls 
> ReplicaState.values().
> This patch creates a cached version of the values and thus avoid all 
> allocation when doing the conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8988) Use LightWeightHashSet instead of LightWeightLinkedSet in BlockManager#excessReplicateMap

2015-10-12 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8988:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2.

> Use LightWeightHashSet instead of LightWeightLinkedSet in 
> BlockManager#excessReplicateMap
> -
>
> Key: HDFS-8988
> URL: https://issues.apache.org/jira/browse/HDFS-8988
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yi Liu
>Assignee: Yi Liu
> Fix For: 2.8.0
>
> Attachments: HDFS-8988.001.patch, HDFS-8988.002.patch
>
>
> {code}
> public final Map excessReplicateMap = 
> new HashMap<>();
> {code}
> {{LightWeightLinkedSet}} extends {{LightWeightHashSet}} and in addition it 
> stores elements in double linked list to ensure ordered traversal. So it 
> requires more memory for each entry (2 references = 8 + 8 bytes = 16 bytes, 
> assume 64-bits system/JVM).  
> I have traversed the source code, and we don't need ordered traversal for 
> excess replicated blocks, so could use  {{LightWeightHashSet}} to save memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8988) Use LightWeightHashSet instead of LightWeightLinkedSet in BlockManager#excessReplicateMap

2015-10-12 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952724#comment-14952724
 ] 

Yi Liu commented on HDFS-8988:
--

Thanks [~umamaheswararao] for the review, the test failure are not related and 
can pass locally.
I will fix the checkstyle about one line is longer than 80 characters while 
committing.

> Use LightWeightHashSet instead of LightWeightLinkedSet in 
> BlockManager#excessReplicateMap
> -
>
> Key: HDFS-8988
> URL: https://issues.apache.org/jira/browse/HDFS-8988
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: HDFS-8988.001.patch, HDFS-8988.002.patch
>
>
> {code}
> public final Map excessReplicateMap = 
> new HashMap<>();
> {code}
> {{LightWeightLinkedSet}} extends {{LightWeightHashSet}} and in addition it 
> stores elements in double linked list to ensure ordered traversal. So it 
> requires more memory for each entry (2 references = 8 + 8 bytes = 16 bytes, 
> assume 64-bits system/JVM).  
> I have traversed the source code, and we don't need ordered traversal for 
> excess replicated blocks, so could use  {{LightWeightHashSet}} to save memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-10 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9053:
-
Comment: was deleted

(was: \\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  22m 41s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 8 new or modified test files. |
| {color:green}+1{color} | javac |   8m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  11m 28s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 22s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m  6s | The applied patch generated  8 
new checkstyle issues (total was 0, now 8). |
| {color:red}-1{color} | whitespace |   0m 10s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 47s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 38s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 47s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | common tests |   7m 48s | Tests passed in 
hadoop-common. |
| {color:red}-1{color} | hdfs tests | 120m 50s | Tests failed in hadoop-hdfs. |
| | | 181m 56s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.hdfs.TestWriteRead |
|   | hadoop.hdfs.server.namenode.ha.TestDNFencing |
|   | hadoop.hdfs.TestHFlush |
|   | hadoop.security.TestPermission |
|   | hadoop.hdfs.TestParallelRead |
|   | hadoop.fs.viewfs.TestViewFsHdfs |
|   | hadoop.hdfs.TestBlockReaderLocalLegacy |
|   | hadoop.hdfs.server.namenode.TestXAttrConfigFlag |
|   | hadoop.hdfs.TestPread |
|   | hadoop.hdfs.TestMiniDFSCluster |
|   | hadoop.hdfs.TestDFSStripedOutputStream |
|   | hadoop.hdfs.TestWriteConfigurationToDFS |
|   | hadoop.hdfs.TestDatanodeRegistration |
|   | hadoop.hdfs.web.TestWebHDFSXAttr |
|   | hadoop.hdfs.TestDFSRollback |
|   | hadoop.hdfs.TestDataTransferKeepalive |
|   | hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA |
|   | hadoop.hdfs.TestDatanodeConfig |
|   | hadoop.hdfs.TestDFSFinalize |
|   | hadoop.hdfs.server.namenode.ha.TestGetGroupsWithHA |
|   | hadoop.fs.TestWebHdfsFileContextMainOperations |
|   | hadoop.fs.TestGlobPaths |
|   | hadoop.hdfs.TestDFSShell |
|   | hadoop.hdfs.tools.TestDFSHAAdminMiniCluster |
|   | hadoop.hdfs.TestSeekBug |
|   | hadoop.fs.loadGenerator.TestLoadGenerator |
|   | hadoop.hdfs.TestCrcCorruption |
|   | hadoop.fs.contract.hdfs.TestHDFSContractMkdir |
|   | hadoop.hdfs.TestAbandonBlock |
|   | hadoop.hdfs.TestGetFileChecksum |
|   | hadoop.hdfs.TestSafeModeWithStripedFile |
|   | 
hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForContentSummary |
|   | hadoop.hdfs.TestFileCreationDelete |
|   | hadoop.hdfs.TestReadWhileWriting |
|   | hadoop.fs.viewfs.TestViewFileSystemWithAcls |
|   | hadoop.fs.contract.hdfs.TestHDFSContractConcat |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 |
|   | hadoop.hdfs.util.TestDiff |
|   | hadoop.hdfs.security.TestDelegationToken |
|   | hadoop.fs.TestSymlinkHdfsDisable |
|   | hadoop.fs.contract.hdfs.TestHDFSContractRootDirectory |
|   | hadoop.hdfs.TestMissingBlocksAlert |
|   | hadoop.hdfs.TestBlocksScheduledCounter |
|   | hadoop.hdfs.TestSmallBlock |
|   | hadoop.cli.TestDeleteCLI |
|   | hadoop.hdfs.TestDFSClientRetries |
|   | hadoop.fs.viewfs.TestViewFsWithXAttrs |
|   | hadoop.hdfs.TestDFSMkdirs |
|   | hadoop.hdfs.tools.TestDFSAdmin |
|   | hadoop.hdfs.server.namenode.TestFavoredNodesEndToEnd |
|   | hadoop.hdfs.server.namenode.TestNameNodeMXBean |
|   | hadoop.hdfs.web.TestWebHDFSForHA |
|   | hadoop.fs.viewfs.TestViewFsDefaultValue |
|   | hadoop.fs.contract.hdfs.TestHDFSContractOpen |
|   | hadoop.hdfs.server.namenode.TestDeleteRace |
|   | hadoop.hdfs.TestFSInputChecker |
|   | hadoop.hdfs.web.TestWebHdfsWithAuthenticationFilter |
|   | hadoop.fs.contract.hdfs.TestHDFSContractRename |
|   | hadoop.hdfs.server.namenode.ha.TestXAttrsWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestHAMetrics |
|   | hadoop.cli.TestErasureCodingCLI |
|   | hadoop.hdfs.TestRollingUpgradeRollback |
|   | hadoop.hdfs.TestRemoteBlockReader |
|   | hadoop.hdfs.server.namenode.TestNameNodeResourceChecker |
|   | hadoop.hdfs.tools.TestDFSAdminWithHA |
|   | hadoop.hdfs.TestBlockStoragePolicy |
|   | hadoop.hdfs.TestLeaseRecovery |
|   | hadoop.fs.viewfs.TestViewFsAtHdfsRoot |
|   | hadoop.hdfs.TestBlockReaderLocal |
|   | hadoop.fs.contract.hdfs.TestHDFSContractGetFileStatus |
|   | hadoop.cli.TestCryptoAdminCLI |
|   | 

[jira] [Updated] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9053:
-
Attachment: HDFS-9053.005.patch

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 8:49 AM:
---

Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. I 
will give the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 


was (Author: hitliuyi):
Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>

[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950082#comment-14950082
 ] 

Yi Liu commented on HDFS-9053:
--

Update the patch. The new patch uses an array list if the children size is 
small (<= 4K), otherwise use B-Tree.
The new patch includes following changes:
# add {{SortedCollection}} which uses an array list to store elements if the 
size is small, otherwise use B-Tree.  It implements a shrinkable array list and 
control expanding. The merits comparing with using java ArrayList in it are: 
(1) Fewer memory: save the object overhead/alignment of ArrayList. (2) The max 
capacity is 4K, so no need to expand to capacity larger than 4K (3) Shrinkable: 
if the elements size becomes few, the internal array will shrink.
# Add more long tests for the {{B-Tree}} and {{SortedCollection}}.

I am still running the long running tests locally, they all success so far.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950084#comment-14950084
 ] 

Yi Liu commented on HDFS-9053:
--

The increased memory usage is only 8 bytes in {{SortedCollection}} comparing 
with {{ArrayList}} if the elements size is small (<= 4K).
{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.hdfs.util.SortedCollection object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int SortedCollection.initCapacity  N/A
 20 4  int SortedCollection.size  N/A
 24 4  int SortedCollection.modCount  N/A
 28 4  int SortedCollection.degreeN/A
 32 8 Object[] SortedCollection.elements  N/A
 40 8BTree SortedCollection.btree N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}

Hi [~jingzhao], [~szetszwo], could you review the new patch? Thanks.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8988) Use LightWeightHashSet instead of LightWeightLinkedSet in BlockManager#excessReplicateMap

2015-10-09 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8988:
-
Attachment: HDFS-8988.002.patch

Rebase the patch.

> Use LightWeightHashSet instead of LightWeightLinkedSet in 
> BlockManager#excessReplicateMap
> -
>
> Key: HDFS-8988
> URL: https://issues.apache.org/jira/browse/HDFS-8988
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: HDFS-8988.001.patch, HDFS-8988.002.patch
>
>
> {code}
> public final Map excessReplicateMap = 
> new HashMap<>();
> {code}
> {{LightWeightLinkedSet}} extends {{LightWeightHashSet}} and in addition it 
> stores elements in double linked list to ensure ordered traversal. So it 
> requires more memory for each entry (2 references = 8 + 8 bytes = 16 bytes, 
> assume 64-bits system/JVM).  
> I have traversed the source code, and we don't need ordered traversal for 
> excess replicated blocks, so could use  {{LightWeightHashSet}} to save memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8988) Use LightWeightHashSet instead of LightWeightLinkedSet in BlockManager#excessReplicateMap

2015-10-09 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8988:
-
Priority: Major  (was: Minor)

> Use LightWeightHashSet instead of LightWeightLinkedSet in 
> BlockManager#excessReplicateMap
> -
>
> Key: HDFS-8988
> URL: https://issues.apache.org/jira/browse/HDFS-8988
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yi Liu
>Assignee: Yi Liu
> Attachments: HDFS-8988.001.patch
>
>
> {code}
> public final Map excessReplicateMap = 
> new HashMap<>();
> {code}
> {{LightWeightLinkedSet}} extends {{LightWeightHashSet}} and in addition it 
> stores elements in double linked list to ensure ordered traversal. So it 
> requires more memory for each entry (2 references = 8 + 8 bytes = 16 bytes, 
> assume 64-bits system/JVM).  
> I have traversed the source code, and we don't need ordered traversal for 
> excess replicated blocks, so could use  {{LightWeightHashSet}} to save memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9053:
-
Attachment: HDFS-9053.006.patch

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950498#comment-14950498
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 2:51 PM:
---

I find a good approach to improve B-Tree memory overhead to make it only 
increase *8 bytes* memory usage comparing with using {{ArrayList}} for small 
elements size. 
So we don't need to use ArrayList when #children is small (< 4K), and we can 
always use the {{BTree}}.
The main idea is to let {{BTree}} extend the BTree Node, then we don't need a 
separate root node, since {{BTree}} itself is the root.

{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.util.btree.BTree object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
 40 4  int BTree.size N/A
 44 4  int BTree.modCount N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}
We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for 
a {{INodeDirectory}}.

[~jingzhao], [~szetszwo], please look at the new patch {{006}}.


was (Author: hitliuyi):
I find a good approach to improve B-Tree overhead to make it only increase *8 
bytes* memory usage comparing with using {{ArrayList}} for small elements size. 
So we don't need to use ArrayList when #children is small (< 4K), and we can 
always use the {{BTree}}.
The main idea is to let {{BTree}} extend the BTree Node, then we don't need a 
separate root node, since {{BTree}} itself is the root.

{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.util.btree.BTree object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
 40 4  int BTree.size N/A
 44 4  int BTree.modCount N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}
We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for 
a {{INodeDirectory}}.

[~jingzhao], [~szetszwo], please look at the new patch {{006}}.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in 

[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950498#comment-14950498
 ] 

Yi Liu commented on HDFS-9053:
--

I find a good approach to improve B-Tree overhead to make it only increase *8 
bytes* memory usage comparing with using {{ArrayList}} for small elements size. 
So we don't need to use ArrayList when #children is small (< 4K), and we can 
always use the {{BTree}}.
The main idea is to let {{BTree}} extend the BTree Node, then we don't need a 
separate root node, since {{BTree}} itself is the root.

{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.util.btree.BTree object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
 40 4  int BTree.size N/A
 44 4  int BTree.modCount N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}
We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for 
a {{INodeDirectory}}.

[~jingzhao], [~szetszwo], please look at the new patch {{006}}.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-08 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948629#comment-14948629
 ] 

Yi Liu commented on HDFS-9053:
--

Hi [~szetszwo], do you have further comments about it? Thanks. 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-08 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847
 ] 

Yi Liu commented on HDFS-9053:
--

Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.

{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-08 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 3:51 AM:
---

Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 


was (Author: hitliuyi):
Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.

{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: 

[jira] [Commented] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

2015-10-07 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947963#comment-14947963
 ] 

Yi Liu commented on HDFS-9137:
--

+1, thanks Uma, Colin and Vinay.

> DeadLock between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded 
> --
>
> Key: HDFS-9137
> URL: https://issues.apache.org/jira/browse/HDFS-9137
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.7.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-9137.00.patch, 
> HDFS-9137.01-WithPreservingRootExceptions.patch, HDFSS-9137.02.patch
>
>
> I can see this code flows between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded could cause deadLock.
> In practice situation may be rare as user calling refreshVolumes at the time 
> DN registration with NN. But seems like issue can happen.
>  Reason for deadLock:
>   1) refreshVolumes will be called with DN lock and after at the end it will 
> also trigger Block report. In the Block report call, 
> BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes 
> readLock on bpos.
>  DN lock then boos lock
> 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and 
>  calling dn.bpRegistrationSucceeded which is again synchronized call on DN.
> bpos lock and then DN lock.
> So, this can clearly create dead lock.
> I think simple fix could be to move triggerBlockReport call outside out DN 
> lock and I feel that call may not be really needed inside DN lock.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

2015-10-07 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-9137:
-
  Resolution: Fixed
Hadoop Flags: Reviewed
   Fix Version/s: 2.8.0
Target Version/s: 2.8.0
  Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2.

> DeadLock between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded 
> --
>
> Key: HDFS-9137
> URL: https://issues.apache.org/jira/browse/HDFS-9137
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.7.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Fix For: 2.8.0
>
> Attachments: HDFS-9137.00.patch, 
> HDFS-9137.01-WithPreservingRootExceptions.patch, HDFSS-9137.02.patch
>
>
> I can see this code flows between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded could cause deadLock.
> In practice situation may be rare as user calling refreshVolumes at the time 
> DN registration with NN. But seems like issue can happen.
>  Reason for deadLock:
>   1) refreshVolumes will be called with DN lock and after at the end it will 
> also trigger Block report. In the Block report call, 
> BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes 
> readLock on bpos.
>  DN lock then boos lock
> 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and 
>  calling dn.bpRegistrationSucceeded which is again synchronized call on DN.
> bpos lock and then DN lock.
> So, this can clearly create dead lock.
> I think simple fix could be to move triggerBlockReport call outside out DN 
> lock and I feel that call may not be really needed inside DN lock.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-9212) Are there any official performance tests or reports using WebHDFS

2015-10-07 Thread Yi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu resolved HDFS-9212.
--
Resolution: Invalid

Send email to u...@hadoop.apache.org

> Are there any official performance tests or reports using WebHDFS
> -
>
> Key: HDFS-9212
> URL: https://issues.apache.org/jira/browse/HDFS-9212
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: webhdfs
>Reporter: Jingfei Hu
>Priority: Minor
>
> I'd like to know if there are any performance tests or reports when reading 
> and writing files using WebHDFS rest api. Or any design-time numbers?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

2015-10-06 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946085#comment-14946085
 ] 

Yi Liu commented on HDFS-9137:
--

I think it's OK to do the fix using this approach and update the 
{{BPOS#toString()}} in a follow-on.
The new patch looks good to me, +1 pending Jeninks, thanks Uma, Vinay, Colin. 
How do you think [~vinayrpet], [~cmccabe]?

> DeadLock between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded 
> --
>
> Key: HDFS-9137
> URL: https://issues.apache.org/jira/browse/HDFS-9137
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.7.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-9137.00.patch, 
> HDFS-9137.01-WithPreservingRootExceptions.patch
>
>
> I can see this code flows between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded could cause deadLock.
> In practice situation may be rare as user calling refreshVolumes at the time 
> DN registration with NN. But seems like issue can happen.
>  Reason for deadLock:
>   1) refreshVolumes will be called with DN lock and after at the end it will 
> also trigger Block report. In the Block report call, 
> BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes 
> readLock on bpos.
>  DN lock then boos lock
> 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and 
>  calling dn.bpRegistrationSucceeded which is again synchronized call on DN.
> bpos lock and then DN lock.
> So, this can clearly create dead lock.
> I think simple fix could be to move triggerBlockReport call outside out DN 
> lock and I feel that call may not be really needed inside DN lock.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9182) Cleanup the findbugs and other issues after HDFS EC merged to trunk.

2015-10-06 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946080#comment-14946080
 ] 

Yi Liu commented on HDFS-9182:
--

The patch looks good to me too. Thanks [~umamaheswararao] and [~jingzhao].
Since the patch is straight, and also need to rebase if any committer commits a 
new patch, so how about running a local test-patch at the meantime, and attach 
the local report, if the Jenkins still fails, we can refer to the local report 
and rebase/commit it directly if the local test-patch successes?

> Cleanup the findbugs and other issues after HDFS EC merged to trunk.
> 
>
> Key: HDFS-9182
> URL: https://issues.apache.org/jira/browse/HDFS-9182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Yi Liu
>Assignee: Uma Maheswara Rao G
>Priority: Critical
> Attachments: HDFSS-9182.00.patch, HDFSS-9182.01.patch
>
>
> https://builds.apache.org/job/PreCommit-HDFS-Build/12754/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs-client.html
> https://builds.apache.org/job/PreCommit-HDFS-Build/12754/artifact/patchprocess/patchReleaseAuditProblems.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9180) Update excluded DataNodes in DFSStripedOutputStream based on failures in data streamers

2015-10-06 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944873#comment-14944873
 ] 

Yi Liu commented on HDFS-9180:
--

+1, thanks [~jingzhao], also thanks Walter for the review.

> Update excluded DataNodes in DFSStripedOutputStream based on failures in data 
> streamers
> ---
>
> Key: HDFS-9180
> URL: https://issues.apache.org/jira/browse/HDFS-9180
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-9180.000.patch, HDFS-9180.001.patch, 
> HDFS-9180.002.patch, HDFS-9180.003.patch
>
>
> This is a TODO in HDFS-9040: based on the failures all the striped data 
> streamers hit, the DFSStripedOutputStream should keep a record of all the 
> DataNodes that should be excluded.
> This jira will also fix several bugs in the DFSStripedOutputStream. Will 
> provide more details in the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

2015-10-03 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942520#comment-14942520
 ] 

Yi Liu commented on HDFS-9137:
--

Thanks Uma for the patch. I agree we don't need synchronization of {{DataNode}} 
for Datanode#triggerBlockReport, so the current fix looks good overall. My 
comment is:

{code}
+  IOException exception = null;
   try {
 LOG.info("Reconfiguring " + property + " to " + newVal);
 this.refreshVolumes(newVal);
   } catch (IOException e) {
-throw new ReconfigurationException(property, newVal,
-getConf().get(property), e);
+exception = e;
+  } finally {
+// Send a full block report to let NN acknowledge the volume changes.
+try {
+  triggerBlockReport(
+  new BlockReportOptions.Factory().setIncremental(false).build());
+} catch (IOException e) {
+  LOG.warn("Exception while sending the block report after refresh"
+  + " volumes " + property + " to " + newVal, e);
+} finally {
+  if (exception != null) {
+throw new ReconfigurationException(property, newVal,
+getConf().get(property), exception);
+  }
+}
{code}

I think for the IOException of {{refreshVolumes}}, we just need to wrap it to 
ReconfigurationException then throw as original. For the exception of 
{{triggerBlockReport}}, we also need to wrap it and throw.  I see you just log 
warn for the IOException of {{triggerBlockReport}} in the patch and ignore it, 
any reason to do this?  If we want to do this, we don't need to save the 
IOException of refreshVolumns and throw it in the finally clause of 
triggerBlockReport, we just need to throw it directly, since the {{finally }} 
cause is executed before the throwing happens.

> DeadLock between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded 
> --
>
> Key: HDFS-9137
> URL: https://issues.apache.org/jira/browse/HDFS-9137
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.0.0, 2.7.1
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
> Attachments: HDFS-9137.00.patch
>
>
> I can see this code flows between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded could cause deadLock.
> In practice situation may be rare as user calling refreshVolumes at the time 
> DN registration with NN. But seems like issue can happen.
>  Reason for deadLock:
>   1) refreshVolumes will be called with DN lock and after at the end it will 
> also trigger Block report. In the Block report call, 
> BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes 
> readLock on bpos.
>  DN lock then boos lock
> 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and 
>  calling dn.bpRegistrationSucceeded which is again synchronized call on DN.
> bpos lock and then DN lock.
> So, this can clearly create dead lock.
> I think simple fix could be to move triggerBlockReport call outside out DN 
> lock and I feel that call may not be really needed inside DN lock.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9180) Update excluded DataNodes in DFSStripedOutputStream based on failures in data streamers

2015-10-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942000#comment-14942000
 ] 

Yi Liu commented on HDFS-9180:
--

{quote}
I kicked the Jenkins again. If we all agree the fixes here are valid maybe we 
can consider commit them first and keep fixing remaining issues in separate 
jiras.
{quote}

+1, thanks Jing.

> Update excluded DataNodes in DFSStripedOutputStream based on failures in data 
> streamers
> ---
>
> Key: HDFS-9180
> URL: https://issues.apache.org/jira/browse/HDFS-9180
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: erasure-coding
>Affects Versions: 3.0.0
>Reporter: Jing Zhao
>Assignee: Jing Zhao
> Attachments: HDFS-9180.000.patch, HDFS-9180.001.patch, 
> HDFS-9180.002.patch
>
>
> This is a TODO in HDFS-9040: based on the failures all the striped data 
> streamers hit, the DFSStripedOutputStream should keep a record of all the 
> DataNodes that should be excluded.
> This jira will also fix several bugs in the DFSStripedOutputStream. Will 
> provide more details in the comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942004#comment-14942004
 ] 

Yi Liu edited comment on HDFS-9053 at 10/3/15 12:49 AM:


Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  Object[] BlockInfo.triplets N/A
 72 8 BlockUnderConstructionFeature BlockInfo.uc   N/A
Instance size: 80 bytes (estimated, the sample instance is not available)
{noformat}
# One advantage of B-Tree compared to ArrayList for small size children is: 
B-Tree can shrink. If the children of directory decreases from 4K to less than 
2K, there are 2K * 8 = 16K memory wasteful if using ArrayList. 
# On the other hand, if we do the switch between ArrayList and B-Tree, we may 
need write a class to wrap the two data structures, then  it still needs 
16bytes object overhead + 8 bytes additional reference = 24 bytes. 

How do you say? Thanks, Nicholas.


was (Author: hitliuyi):
Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  

  1   2   3   4   5   6   7   8   9   10   >