[jira] [Comment Edited] (HADOOP-12862) LDAP Group Mapping over SSL can not specify trust store

2018-03-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419252#comment-16419252
 ] 

Zhe Zhang edited comment on HADOOP-12862 at 3/29/18 4:03 PM:
-

Thanks [~shv]. v9 patch LGTM. +1


was (Author: zhz):
Thanks [~shv]. +1 on v9 patch. 

> LDAP Group Mapping over SSL can not specify trust store
> ---
>
> Key: HADOOP-12862
> URL: https://issues.apache.org/jira/browse/HADOOP-12862
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: release-blocker
> Attachments: HADOOP-12862.001.patch, HADOOP-12862.002.patch, 
> HADOOP-12862.003.patch, HADOOP-12862.004.patch, HADOOP-12862.005.patch, 
> HADOOP-12862.006.patch, HADOOP-12862.007.patch, HADOOP-12862.008.patch, 
> HADOOP-12862.009.patch
>
>
> In a secure environment, SSL is used to encrypt LDAP request for group 
> mapping resolution.
> We (+[~yoderme], +[~tgrayson]) have found that its implementation is strange.
> For information, Hadoop name node, as an LDAP client, talks to a LDAP server 
> to resolve the group mapping of a user. In the case of LDAP over SSL, a 
> typical scenario is to establish one-way authentication (the client verifies 
> the server's certificate is real) by storing the server's certificate in the 
> client's truststore.
> A rarer scenario is to establish two-way authentication: in addition to store 
> truststore for the client to verify the server, the server also verifies the 
> client's certificate is real, and the client stores its own certificate in 
> its keystore.
> However, the current implementation for LDAP over SSL does not seem to be 
> correct in that it only configures keystore but no truststore (so LDAP server 
> can verify Hadoop's certificate, but Hadoop may not be able to verify LDAP 
> server's certificate)
> I think there should an extra pair of properties to specify the 
> truststore/password for LDAP server, and use that to configure system 
> properties {{javax.net.ssl.trustStore}}/{{javax.net.ssl.trustStorePassword}}
> I am a security layman so my words can be imprecise. But I hope this makes 
> sense.
> Oracle's SSL LDAP documentation: 
> http://docs.oracle.com/javase/jndi/tutorial/ldap/security/ssl.html
> JSSE reference guide: 
> http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12862) LDAP Group Mapping over SSL can not specify trust store

2018-03-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419252#comment-16419252
 ] 

Zhe Zhang commented on HADOOP-12862:


Thanks [~shv]. +1 on v9 patch. 

> LDAP Group Mapping over SSL can not specify trust store
> ---
>
> Key: HADOOP-12862
> URL: https://issues.apache.org/jira/browse/HADOOP-12862
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: release-blocker
> Attachments: HADOOP-12862.001.patch, HADOOP-12862.002.patch, 
> HADOOP-12862.003.patch, HADOOP-12862.004.patch, HADOOP-12862.005.patch, 
> HADOOP-12862.006.patch, HADOOP-12862.007.patch, HADOOP-12862.008.patch, 
> HADOOP-12862.009.patch
>
>
> In a secure environment, SSL is used to encrypt LDAP request for group 
> mapping resolution.
> We (+[~yoderme], +[~tgrayson]) have found that its implementation is strange.
> For information, Hadoop name node, as an LDAP client, talks to a LDAP server 
> to resolve the group mapping of a user. In the case of LDAP over SSL, a 
> typical scenario is to establish one-way authentication (the client verifies 
> the server's certificate is real) by storing the server's certificate in the 
> client's truststore.
> A rarer scenario is to establish two-way authentication: in addition to store 
> truststore for the client to verify the server, the server also verifies the 
> client's certificate is real, and the client stores its own certificate in 
> its keystore.
> However, the current implementation for LDAP over SSL does not seem to be 
> correct in that it only configures keystore but no truststore (so LDAP server 
> can verify Hadoop's certificate, but Hadoop may not be able to verify LDAP 
> server's certificate)
> I think there should an extra pair of properties to specify the 
> truststore/password for LDAP server, and use that to configure system 
> properties {{javax.net.ssl.trustStore}}/{{javax.net.ssl.trustStorePassword}}
> I am a security layman so my words can be imprecise. But I hope this makes 
> sense.
> Oracle's SSL LDAP documentation: 
> http://docs.oracle.com/javase/jndi/tutorial/ldap/security/ssl.html
> JSSE reference guide: 
> http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15322) LDAPGroupMapping search tree base improvement

2018-03-17 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-15322:
---
Fix Version/s: (was: 2.7.6)

> LDAPGroupMapping search tree base improvement
> -
>
> Key: HADOOP-15322
> URL: https://issues.apache.org/jira/browse/HADOOP-15322
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: common, security
>Affects Versions: 2.7.4
>Reporter: Ganesh
>Priority: Major
>
> Currently the same ldap base is used for searching posixAccount and 
> posixGroup. This request is to make a separate base for each container (ie 
> posixAccount and posixGroup container)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15322) LDAPGroupMapping search tree base improvement

2018-03-17 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-15322:
---
Component/s: security

> LDAPGroupMapping search tree base improvement
> -
>
> Key: HADOOP-15322
> URL: https://issues.apache.org/jira/browse/HADOOP-15322
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: common, security
>Affects Versions: 2.7.4
>Reporter: Ganesh
>Priority: Major
>
> Currently the same ldap base is used for searching posixAccount and 
> posixGroup. This request is to make a separate base for each container (ie 
> posixAccount and posixGroup container)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14742) Document multi-URI replication Inode for ViewFS

2017-11-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271339#comment-16271339
 ] 

Zhe Zhang commented on HADOOP-14742:


Looks like the Description of HADOOP-12077 is a good starting point?

> Document multi-URI replication Inode for ViewFS
> ---
>
> Key: HADOOP-14742
> URL: https://issues.apache.org/jira/browse/HADOOP-14742
> Project: Hadoop Common
>  Issue Type: Task
>  Components: documentation, viewfs
>Affects Versions: 3.0.0-beta1
>Reporter: Chris Douglas
>
> HADOOP-12077 added client-side "replication" capabilities to ViewFS. Its 
> semantics and configuration should be documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14732) ProtobufRpcEngine should use Time.monotonicNow to measure durations

2017-08-29 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14732:
---
Fix Version/s: (was: 3.0.0-beta1)
   (was: 2.9.0)

> ProtobufRpcEngine should use Time.monotonicNow to measure durations
> ---
>
> Key: HADOOP-14732
> URL: https://issues.apache.org/jira/browse/HADOOP-14732
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
> Attachments: HADOOP-14732.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()

2017-07-26 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14214:
---
Fix Version/s: 2.7.4

> DomainSocketWatcher::add()/delete() should not self interrupt while looping 
> await()
> ---
>
> Key: HADOOP-14214
> URL: https://issues.apache.org/jira/browse/HADOOP-14214
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
>Priority: Critical
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HADOOP-14214.000.patch
>
>
> Our hive team found a TPCDS job whose queries running on LLAP seem to be 
> getting stuck. Dozens of threads were waiting for the 
> {{DfsClientShmManager::lock}}, as following jstack:
> {code}
> Thread 251 (IO-Elevator-Thread-5):
>   State: WAITING
>   Blocked count: 3871
>   Wtaited count: 4565
>   Waiting on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198
>   Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
> 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181)
> 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441)
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
> 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
> 
> org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64)
> 
> org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622)
> {code}
> The thread that is expected to signal those threads is calling 
> {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with 
> InterruptedException infinitely. The jstack is like:
> {code}
> Thread 44417 (TezTR-257387_2840_12_10_52_0):
>   State: RUNNABLE
>   Blocked count: 3
>   Wtaited count: 5
>   Stack:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034)
> 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
> 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181)
> 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)

[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()

2017-07-26 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101756#comment-16101756
 ] 

Zhe Zhang commented on HADOOP-14214:


Thanks [~liuml07] for the fix. Indeed a major bug. I just backported to 
branch-2.7.

> DomainSocketWatcher::add()/delete() should not self interrupt while looping 
> await()
> ---
>
> Key: HADOOP-14214
> URL: https://issues.apache.org/jira/browse/HADOOP-14214
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
>Priority: Critical
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HADOOP-14214.000.patch
>
>
> Our hive team found a TPCDS job whose queries running on LLAP seem to be 
> getting stuck. Dozens of threads were waiting for the 
> {{DfsClientShmManager::lock}}, as following jstack:
> {code}
> Thread 251 (IO-Elevator-Thread-5):
>   State: WAITING
>   Blocked count: 3871
>   Wtaited count: 4565
>   Waiting on 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198
>   Stack:
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
> 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181)
> 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478)
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441)
> org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
> 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)
> 
> org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64)
> 
> org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622)
> {code}
> The thread that is expected to signal those threads is calling 
> {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with 
> InterruptedException infinitely. The jstack is like:
> {code}
> Thread 44417 (TezTR-257387_2840_12_10_52_0):
>   State: RUNNABLE
>   Blocked count: 3
>   Wtaited count: 5
>   Stack:
> java.lang.Throwable.fillInStackTrace(Native Method)
> java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> java.lang.Throwable.(Throwable.java:250)
> java.lang.Exception.(Exception.java:54)
> java.lang.InterruptedException.(InterruptedException.java:57)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034)
> 
> org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266)
> 
> org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)
> 
> org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
> 
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
> 
> 

[jira] [Updated] (HADOOP-10829) Iteration on CredentialProviderFactory.serviceLoader is thread-unsafe

2017-07-25 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10829:
---
Fix Version/s: 2.8.3
   2.7.4

Thanks for the fix [~benoyantony]. Given this is a security bug fix, I just 
backported to branch-2.8 and branch-2.7

> Iteration on CredentialProviderFactory.serviceLoader  is thread-unsafe
> --
>
> Key: HADOOP-10829
> URL: https://issues.apache.org/jira/browse/HADOOP-10829
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.6.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
>  Labels: BB2015-05-TBR
> Fix For: 2.9.0, 2.7.4, 3.0.0-beta1, 2.8.3
>
> Attachments: HADOOP-10829.003.patch, HADOOP-10829.patch, 
> HADOOP-10829.patch
>
>
> CredentialProviderFactory uses _ServiceLoader_ framework to load 
> _CredentialProviderFactory_
> {code}
>   private static final ServiceLoader serviceLoader 
> =
>   ServiceLoader.load(CredentialProviderFactory.class);
> {code}
> The _ServiceLoader_ framework does lazy initialization of services which 
> makes it thread unsafe. If accessed from multiple threads, it is better to 
> synchronize the access.
> Similar synchronization has been done while loading compression codec 
> providers via HADOOP-8406. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14599) RPC queue time metrics omit timed out clients

2017-06-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068571#comment-16068571
 ] 

Zhe Zhang commented on HADOOP-14599:


Thanks for working on this [~aramesh2]. Could you re-upload the patch and name 
it HADOOP-14599..patch? Otherwise Jenkins won't work. See 
https://wiki.apache.org/hadoop/HowToContribute#Naming_your_patch 

> RPC queue time metrics omit timed out clients
> -
>
> Key: HADOOP-14599
> URL: https://issues.apache.org/jira/browse/HADOOP-14599
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics, rpc-server
>Affects Versions: 2.7.0
>Reporter: Ashwin Ramesh
>Assignee: Ashwin Ramesh
> Attachments: HADOOP_14599.patch
>
>
> RPC average queue time metrics will now update even if the client who made 
> the call timed out while the call was in the call queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-26 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063791#comment-16063791
 ] 

Zhe Zhang commented on HADOOP-14502:


Sorry, my bad.

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
>  Labels: Incompatible
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, 
> HADOOP-14502.002.patch
>
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-21 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14502:
---
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha4
   Status: Resolved  (was: Patch Available)

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
>  Labels: Incompatible
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, 
> HADOOP-14502.002.patch
>
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-21 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14502:
---
Hadoop Flags: Incompatible change,Reviewed  (was: Incompatible change)

Thanks Erik! +1 on v2 patch as well. Tested with {{MiniHadoopClusterManager}} 
and it shows desired behavior.
{code}
  }, {
"name" : "Hadoop:service=NameNode,name=NameNodeActivity",
"modelerType" : "NameNodeActivity",
"tag.ProcessName" : "NameNode",
"tag.SessionId" : null,
"tag.Context" : "dfs",
"tag.Hostname" : "zezhang-mn1",
"CreateFileOps" : 2,
"FilesCreated" : 12,
"FilesAppended" : 0,
"GetBlockLocations" : 0,
"FilesRenamed" : 0,
"FilesTruncated" : 0,
"GetListingOps" : 1,
"DeleteFileOps" : 0,
"FilesDeleted" : 0,
"FileInfoOps" : 6,
"AddBlockOps" : 2,
"GetAdditionalDatanodeOps" : 0,
"CreateSymlinkOps" : 0,
"GetLinkTargetOps" : 0,
"FilesInGetListingOps" : 0,
"AllowSnapshotOps" : 0,
"DisallowSnapshotOps" : 0,
"CreateSnapshotOps" : 0,
"DeleteSnapshotOps" : 0,
"RenameSnapshotOps" : 0,
"ListSnapshottableDirOps" : 0,
"SnapshotDiffReportOps" : 0,
"BlockReceivedAndDeletedOps" : 2,
"BlockOpsQueued" : 1,
"BlockOpsBatched" : 0,
"TransactionsNumOps" : 24,
"TransactionsAvgTime" : 1.7083,
"SyncsNumOps" : 14,
"SyncsAvgTime" : 0.2857142857142857,
"TransactionsBatchedInSync" : 10,
"StorageBlockReportNumOps" : 2,
"StorageBlockReportAvgTime" : 3.5,
"CacheReportNumOps" : 0,
"CacheReportAvgTime" : 0.0,
"GenerateEDEKTimeNumOps" : 0,
"GenerateEDEKTimeAvgTime" : 0.0,
"WarmUpEDEKTimeNumOps" : 0,
"WarmUpEDEKTimeAvgTime" : 0.0,
"ResourceCheckTimeNumOps" : 8,
"ResourceCheckTimeAvgTime" : 0.0,
"SafeModeTime" : 1,
"FsImageLoadTime" : 76,
"GetEditNumOps" : 0,
"GetEditAvgTime" : 0.0,
"GetImageNumOps" : 0,
"GetImageAvgTime" : 0.0,
"PutImageNumOps" : 0,
"PutImageAvgTime" : 0.0,
"TotalFileOps" : 11
  },
{code}

I'm committing to trunk soon. Let's write a short release note?

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
>  Labels: Incompatible
> Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, 
> HADOOP-14502.002.patch
>
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-20 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14502:
---
Hadoop Flags: Incompatible change

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
>  Labels: Incompatible
> Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, 
> HADOOP-14502.002.patch
>
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14440) Add metrics for connections dropped

2017-06-12 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14440:
---
Fix Version/s: 2.7.4

Thanks for the work [~ebadger]. I think this is a good improvement for 2.7.4; 
just backported to branch-2.7.

> Add metrics for connections dropped
> ---
>
> Key: HADOOP-14440
> URL: https://issues.apache.org/jira/browse/HADOOP-14440
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Eric Badger
>Assignee: Eric Badger
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HADOOP-14440.001.patch, HADOOP-14440.002.patch, 
> HADOOP-14440.003.patch
>
>
> Will be useful to figure out when the NN is getting overloaded with more 
> connections than it can handle



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-07 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041381#comment-16041381
 ] 

Zhe Zhang edited comment on HADOOP-14502 at 6/7/17 6:32 PM:


bq. Can we make StorageBlockReport a MutableRate metric and remove 
NameNodeActivity#BlockReport metric
This sounds good to me (as a 3.0 change).

Pinging [~andrew.wang] for opinion on breaking compatibility in this case.


was (Author: zhz):
bq. Can we make StorageBlockReport a MutableRate metric and remove 
NameNodeActivity#BlockReport metric
This sounds good to me (as a 3.0 change).

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Priority: Minor
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps

2017-06-07 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041381#comment-16041381
 ] 

Zhe Zhang commented on HADOOP-14502:


bq. Can we make StorageBlockReport a MutableRate metric and remove 
NameNodeActivity#BlockReport metric
This sounds good to me (as a 3.0 change).

> Confusion/name conflict between NameNodeActivity#BlockReportNumOps and 
> RpcDetailedActivity#BlockReportNumOps
> 
>
> Key: HADOOP-14502
> URL: https://issues.apache.org/jira/browse/HADOOP-14502
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Priority: Minor
>
> Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the 
> {{RpcDetailedActivity}} context and those emitted under the 
> {{NameNodeActivity}} context are actually reporting different things despite 
> having the same name. {{NameNodeActivity}} reports the count/time of _per 
> storage_ block reports, whereas {{RpcDetailedActivity}} reports the 
> count/time of _per datanode_ block reports. This makes for a confusing 
> experience with two metrics having the same name reporting different values. 
> We already have the {{StorageBlockReportsOps}} metric under 
> {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} 
> metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other 
> suggestions about how to address this as well. The 3.0 release seems a good 
> time to make this incompatible change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13433) Race in UGI.reloginFromKeytab

2017-04-28 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989249#comment-15989249
 ] 

Zhe Zhang commented on HADOOP-13433:


[~daryn] Any chance you can upload the internal fix? I'll be very happy to help 
review. Thanks.

> Race in UGI.reloginFromKeytab
> -
>
> Key: HADOOP-13433
> URL: https://issues.apache.org/jira/browse/HADOOP-13433
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.9.0, 2.7.4, 2.6.6, 2.8.1, 3.0.0-alpha3
>
> Attachments: HADOOP-13433-branch-2.7.patch, 
> HADOOP-13433-branch-2.7-v1.patch, HADOOP-13433-branch-2.7-v2.patch, 
> HADOOP-13433-branch-2.8.patch, HADOOP-13433-branch-2.8.patch, 
> HADOOP-13433-branch-2.8-v1.patch, HADOOP-13433-branch-2.patch, 
> HADOOP-13433.patch, HADOOP-13433-v1.patch, HADOOP-13433-v2.patch, 
> HADOOP-13433-v4.patch, HADOOP-13433-v5.patch, HADOOP-13433-v6.patch, 
> HBASE-13433-testcase-v3.patch
>
>
> This is a problem that has troubled us for several years. For our HBase 
> cluster, sometimes the RS will be stuck due to
> {noformat}
> 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception 
> encountered while connecting to the server :
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: The ticket 
> isn't for us (35) - BAD TGS SERVER NAME)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
> at 
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
> at 
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
> at 
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
> at 
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
> at 
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
> at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
> at org.apache.hadoop.hbase.security.User.call(User.java:607)
> at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
> at 
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
> at 
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
> at 
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
> at 
> org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
> at $Proxy24.replicateLogEntries(Unknown Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
> Caused by: GSSException: No valid credentials provided (Mechanism level: The 
> ticket isn't for us (35) - BAD TGS SERVER NAME)
> at 
> sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
> at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
> at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
> ... 23 more
> Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
> at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:64)
> at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
> at 
> sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
> at 
> sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
> at 
> sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
> at 
> 

[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer

2017-04-07 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14276:
---
   Resolution: Fixed
Fix Version/s: 2.8.1
   2.7.4
   2.9.0
   Status: Resolved  (was: Patch Available)

Committed all the way to branch-2.7. Thanks Erik for the work.

> Add a nanosecond API to Time/Timer/FakeTimer
> 
>
> Key: HADOOP-14276
> URL: https://issues.apache.org/jira/browse/HADOOP-14276
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3
>
> Attachments: HADOOP-14276.000.patch
>
>
> Right now {{Time}}/{{Timer}} export functionality for retrieving time at a 
> millisecond-level precision but not at a nanosecond-level precision, which is 
> required for some applications (there's ~70 usages). Most of these seem not 
> to need mocking functionality for tests; only one class currently mocks this 
> out ({{LightWeightCache}}) but we would like to add another as part of 
> HDFS-11615 and want to avoid code duplication. This could be useful for other 
> classes in the future as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer

2017-04-06 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14276:
---
Fix Version/s: 3.0.0-alpha3

Committed to trunk. YARN-6288 is breaking branch-2 build, waiting for the fix.

> Add a nanosecond API to Time/Timer/FakeTimer
> 
>
> Key: HADOOP-14276
> URL: https://issues.apache.org/jira/browse/HADOOP-14276
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Fix For: 3.0.0-alpha3
>
> Attachments: HADOOP-14276.000.patch
>
>
> Right now {{Time}}/{{Timer}} export functionality for retrieving time at a 
> millisecond-level precision but not at a nanosecond-level precision, which is 
> required for some applications (there's ~70 usages). Most of these seem not 
> to need mocking functionality for tests; only one class currently mocks this 
> out ({{LightWeightCache}}) but we would like to add another as part of 
> HDFS-11615 and want to avoid code duplication. This could be useful for other 
> classes in the future as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer

2017-04-06 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14276:
---
Hadoop Flags: Reviewed

Thanks for confirming Liang. I'll commit the patch to trunk~branch-2.7 soon.

> Add a nanosecond API to Time/Timer/FakeTimer
> 
>
> Key: HADOOP-14276
> URL: https://issues.apache.org/jira/browse/HADOOP-14276
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Attachments: HADOOP-14276.000.patch
>
>
> Right now {{Time}}/{{Timer}} export functionality for retrieving time at a 
> millisecond-level precision but not at a nanosecond-level precision, which is 
> required for some applications (there's ~70 usages). Most of these seem not 
> to need mocking functionality for tests; only one class currently mocks this 
> out ({{LightWeightCache}}) but we would like to add another as part of 
> HDFS-11615 and want to avoid code duplication. This could be useful for other 
> classes in the future as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer

2017-04-06 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959573#comment-15959573
 ] 

Zhe Zhang commented on HADOOP-14276:


Thanks Erik. The analysis makes sense and +1 on the patch. Will wait for 2 
hours before committing.

> Add a nanosecond API to Time/Timer/FakeTimer
> 
>
> Key: HADOOP-14276
> URL: https://issues.apache.org/jira/browse/HADOOP-14276
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: util
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Attachments: HADOOP-14276.000.patch
>
>
> Right now {{Time}}/{{Timer}} export functionality for retrieving time at a 
> millisecond-level precision but not at a nanosecond-level precision, which is 
> required for some applications (there's ~70 usages). Most of these seem not 
> to need mocking functionality for tests; only one class currently mocks this 
> out ({{LightWeightCache}}) but we would like to add another as part of 
> HDFS-11615 and want to avoid code duplication. This could be useful for other 
> classes in the future as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14211) FilterFs and ChRootedFs are too aggressive about enforcing "authorityNeeded"

2017-03-24 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-14211:
---
Fix Version/s: 2.8.1
   2.7.4

Thanks [~xkrogen] for the work and [~andrew.wang] for the review. +1 on the 
patch as well. I just cherry-picked to branch-2.8 and branch-2.7.

> FilterFs and ChRootedFs are too aggressive about enforcing "authorityNeeded"
> 
>
> Key: HADOOP-14211
> URL: https://issues.apache.org/jira/browse/HADOOP-14211
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: viewfs
>Affects Versions: 2.6.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3
>
> Attachments: HADOOP-14211.000.patch, HADOOP-14211.001.patch
>
>
> Right now {{FilterFs}} and {{ChRootedFs}} pass the following up to the 
> {{AbstractFileSystem}} superconstructor:
> {code}
> super(fs.getUri(), fs.getUri().getScheme(),
> fs.getUri().getAuthority() != null, fs.getUriDefaultPort());
> {code}
> This passes a value of {{authorityNeeded==true}} for any {{fs}} which has an 
> authority, but this isn't necessarily the case - ViewFS is an example of 
> this. You will encounter this issue if you try to filter a ViewFS, or nest 
> one ViewFS within another. The {{authorityNeeded}} check isn't necessary in 
> this case anyway; {{fs}} is already an instantiated {{AbstractFileSystem}} 
> which means it has already used the same constructor with the value of 
> {{authorityNeeded}} (and corresponding validation) that it actually requires.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults

2017-03-22 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-9631:
--
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha3
   2.8.1
   2.7.4
   2.9.0
   Status: Resolved  (was: Patch Available)

Committed to above mentioned branches. Thanks Lohit and Erik for the 
contribution!

> ViewFs should use underlying FileSystem's server side defaults
> --
>
> Key: HADOOP-9631
> URL: https://issues.apache.org/jira/browse/HADOOP-9631
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, viewfs
>Affects Versions: 2.0.4-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Erik Krogen
>  Labels: BB2015-05-TBR
> Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3
>
> Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, 
> HADOOP-9631.007.patch, HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, 
> HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java
>
>
> On a cluster with ViewFS as default FileSystem, creating files using 
> FileContext will always result with replication factor of 1, instead of 
> underlying filesystem default (like HDFS)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults

2017-03-22 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-9631:
--
Hadoop Flags: Reviewed
Target Version/s: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3

Thanks [~xkrogen] for the update! +1 on v7 patch. I just committed to trunk, 
working on backports. I think this should go into branch-2, branch-2.8, and 
branch-2.7. 

> ViewFs should use underlying FileSystem's server side defaults
> --
>
> Key: HADOOP-9631
> URL: https://issues.apache.org/jira/browse/HADOOP-9631
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, viewfs
>Affects Versions: 2.0.4-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Erik Krogen
>  Labels: BB2015-05-TBR
> Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, 
> HADOOP-9631.007.patch, HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, 
> HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java
>
>
> On a cluster with ViewFS as default FileSystem, creating files using 
> FileContext will always result with replication factor of 1, instead of 
> underlying filesystem default (like HDFS)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults

2017-03-22 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937454#comment-15937454
 ] 

Zhe Zhang commented on HADOOP-9631:
---

Thanks [~xkrogen] for the work! v6 patch LGTM, with a couple of minor comments:
# {{ViewFs#getServerDefaults}} has unnecessary exceptions in signature
# Can we enhance the test to cover the case where 
{{ViewFs#getServerDefaults(f)}} where f is an internal dir?

> ViewFs should use underlying FileSystem's server side defaults
> --
>
> Key: HADOOP-9631
> URL: https://issues.apache.org/jira/browse/HADOOP-9631
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs, viewfs
>Affects Versions: 2.0.4-alpha
>Reporter: Lohit Vijayarenu
>Assignee: Erik Krogen
>  Labels: BB2015-05-TBR
> Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, 
> HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, 
> HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java
>
>
> On a cluster with ViewFS as default FileSystem, creating files using 
> FileContext will always result with replication factor of 1, instead of 
> underlying filesystem default (like HDFS)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-14147) Offline Image Viewer bug

2017-03-09 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang resolved HADOOP-14147.

Resolution: Duplicate

> Offline Image Viewer  bug
> -
>
> Key: HADOOP-14147
> URL: https://issues.apache.org/jira/browse/HADOOP-14147
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: gehaijiang
>
> $ hdfs oiv -p Delimited  -i fsimage_13752447421 -o fsimage.xml
> 17/03/04 08:40:22 INFO offlineImageViewer.FSImageHandler: Loading 757 strings
> 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading 
> directories
> 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading 
> directories in INode section.
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Found 4374109 
> directories in INode section.
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Finished loading 
> directories in 96798ms
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Loading INode 
> directory section.
> Exception in thread "main" java.lang.IllegalStateException
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.buildNamespace(PBImageTextWriter.java:570)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.loadINodeDirSection(PBImageTextWriter.java:522)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.visit(PBImageTextWriter.java:460)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.visit(PBImageDelimitedTextWriter.java:46)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:182)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:124)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Reopened] (HADOOP-14147) Offline Image Viewer bug

2017-03-09 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang reopened HADOOP-14147:


> Offline Image Viewer  bug
> -
>
> Key: HADOOP-14147
> URL: https://issues.apache.org/jira/browse/HADOOP-14147
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: gehaijiang
>
> $ hdfs oiv -p Delimited  -i fsimage_13752447421 -o fsimage.xml
> 17/03/04 08:40:22 INFO offlineImageViewer.FSImageHandler: Loading 757 strings
> 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading 
> directories
> 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading 
> directories in INode section.
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Found 4374109 
> directories in INode section.
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Finished loading 
> directories in 96798ms
> 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Loading INode 
> directory section.
> Exception in thread "main" java.lang.IllegalStateException
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.buildNamespace(PBImageTextWriter.java:570)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.loadINodeDirSection(PBImageTextWriter.java:522)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.visit(PBImageTextWriter.java:460)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.visit(PBImageDelimitedTextWriter.java:46)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:182)
>   at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:124)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-15 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869273#comment-15869273
 ] 

Zhe Zhang commented on HADOOP-14086:


Thanks Zheng. This will be a very useful improvement. Any idea how to reduce NN 
workload? At the end of the day, if we distcp 1M files we need to call 1M 
{{getFileInfo}}.. We thought about querying the SbNN but haven't investigated 
too far.

> Improve DistCp Speed for small files
> 
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.6.5
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13742) Expose "NumOpenConnectionsPerUser" as a metric

2016-11-28 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13742:
---
Fix Version/s: 2.7.4

Thanks Brahma for the work! I think this is a good improvement for branch-2.7 
as well. I just did the backport.

> Expose "NumOpenConnectionsPerUser" as a metric
> --
>
> Key: HADOOP-13742
> URL: https://issues.apache.org/jira/browse/HADOOP-13742
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-13742-002.patch, HADOOP-13742-003.patch, 
> HADOOP-13742-004.patch, HADOOP-13742-005.patch, HADOOP-13742-006.patch, 
> HADOOP-13742.patch
>
>
> To track user level connections( How many connections for each user) in busy 
> cluster where so many connections to server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-08 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13782:
---
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha2
   2.7.4
   2.8.0
   Status: Resolved  (was: Patch Available)

I just committed the patch to trunk~branch-2.7. Thanks Erik for the 
contribution!

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, 
> HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, 
> HADOOP-13782.005.patch, HADOOP-13782.006.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-08 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13782:
---
Hadoop Flags: Reviewed

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, 
> HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, 
> HADOOP-13782.005.patch, HADOOP-13782.006.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-08 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649117#comment-15649117
 ] 

Zhe Zhang commented on HADOOP-13782:


Thanks Erik! +1 on v6 patch pending Jenkins.

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, 
> HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, 
> HADOOP-13782.005.patch, HADOOP-13782.006.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-08 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648515#comment-15648515
 ] 

Zhe Zhang commented on HADOOP-13782:


Thanks Erik for the update! With HADOOP-13804 change the new class is much 
cleaner :)

The current concurrency model is still a little complicated. {{snapshot}} has a 
nested synchronization on {{globalMetrics}} and {{stat}}, where {{stat}} is a 
local variable. Maybe we can simplify the concurrency model by:
# Make {{globalMetrics}} a ConcurrentMap
# Do we want to support multiple threads doing {{snapshot}} at the same time? 
If not, we should probably make it a synchronized method so it's easier to 
maintain and reason about
# Maybe creating a concurrent version of {{SampleStat}}, because that's the 
only object we want to protect from concurrent updating (local thread adding, 
and the snapshotting thread resetting).
{code}
  private class ConcurrentSampleStat extends SampleStat {
@Override
public synchronized void reset(){
  super.reset();
}
@Override
public synchronized SampleStat add(double x) {
  return super.add(x);
}
  }
{code}
# {{threadLocalMetricsMap}} can be a regular instead of concurrent map?

Also, IIUC, {{snapshot}} is supposed to clear all metrics from the last window. 
In the v4 patch, if a certain type of metrics appeared in the last window but 
disappears in the current window (e.g. thread dies), the entry in 
{{globalMetrics}} is not cleared.

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, 
> HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-08 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13782:
---
Target Version/s: 2.7.4

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, 
> HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13804) MutableStat mean loses accuracy if add(long, long) is used

2016-11-07 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13804:
---
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha2
   2.7.4
   2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk~branch-2.7. Thanks Erik for the contribution.

> MutableStat mean loses accuracy if add(long, long) is used
> --
>
> Key: HADOOP-13804
> URL: https://issues.apache.org/jira/browse/HADOOP-13804
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-13804.000.patch
>
>
> Currently if the {{MutableStat.add(long numSamples, long sum)}} method is 
> used with a large sample count, the mean that is returned will be very 
> inaccurate. This is a result of using the Welford method for variance 
> calculation, which assumes that each sample is processed on its own, to 
> calculate the mean as well. For variance this is fine, since variance numbers 
> lose meaning if you add many samples at once, but the mean should still be 
> accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13804) MutableStat mean loses accuracy if add(long, long) is used

2016-11-07 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13804:
---
Target Version/s: 2.7.4
Hadoop Flags: Reviewed

Thanks Erik for the fix. The patch LGTM; +1 pending Jenkins.

Result of the added test without the change:
{code}
java.lang.AssertionError: Bad value for metric TestAvgVal 
Expected :1.5
Actual   :1.9995
{code}

> MutableStat mean loses accuracy if add(long, long) is used
> --
>
> Key: HADOOP-13804
> URL: https://issues.apache.org/jira/browse/HADOOP-13804
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Minor
> Attachments: HADOOP-13804.000.patch
>
>
> Currently if the {{MutableStat.add(long numSamples, long sum)}} method is 
> used with a large sample count, the mean that is returned will be very 
> inaccurate. This is a result of using the Welford method for variance 
> calculation, which assumes that each sample is processed on its own, to 
> calculate the mean as well. For variance this is fine, since variance numbers 
> lose meaning if you add many samples at once, but the mean should still be 
> accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read

2016-11-07 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645199#comment-15645199
 ] 

Zhe Zhang commented on HADOOP-13782:


Thanks Erik for the patch. LGTM overall. A few detailed comments:

# It'd be ideal if we can simplify the two internal classes 
{{LocalMutableRate}} and {{MutableRateInternal}}, and also better fit them with 
existing {{MutableStat}} or {{MutableRate}} classes. We discussed offline an 
issue in existing {{MutableStat}} batch add method around {{intervalStat}}. I 
think we should document the issue so other developers understand the 
motivation of creating a simpler rate class.
# The below synchronization behavior is different than {{MutableStat}}, where 
both {{snapshot}} and {{add}} methods are {{synchronized}}. Should we allow 
thread-local {{add}} while one thread is doing {{snapshot}}?
{code}
  @Override
  public void snapshot(MetricsRecordBuilder rb, boolean all) {
synchronized (globalMetrics) {
{code}
# Maybe we should comment below that we will be doing aggregation (a main logic 
in this class)
{code}
} else {
  for (Map.Entry ent : map.entrySet()) {
{code}
# Cosmetic: since {{getLocalMetrics}} is short and is only used by {{add}} 
(which itself is short), can we merge the two methods?
# Cosmetic: as a follow-on we can consider consolidating the old 
{{MutableRates}} and the new {{MutableRatesWithAggregation}} to reduce 
duplication

> Make MutableRates metrics thread-local write, aggregate-on-read
> ---
>
> Key: HADOOP-13782
> URL: https://issues.apache.org/jira/browse/HADOOP-13782
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Erik Krogen
>Assignee: Erik Krogen
> Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch
>
>
> Currently the {{MutableRates}} metrics class serializes all writes to metrics 
> it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two 
> increments of unrelated metrics contained within the same {{MutableRates}} 
> object will serialize w.r.t. each other). This class is used by 
> {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to 
> modify these metrics. Instead we should allow updates to unrelated metrics 
> objects to happen concurrently. To do so we can let each thread locally 
> collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the 
> threads. 
> I have collected some benchmark performance numbers in HADOOP-13747 
> (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) 
> which indicate that this can bring significantly higher performance in high 
> contention situations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12483) Maintain wrapped SASL ordering for postponed IPC responses

2016-11-02 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12483:
---
Fix Version/s: 2.7.4

I backported this bug fix to branch-2.7 since I just backported HADOOP-10300.

> Maintain wrapped SASL ordering for postponed IPC responses
> --
>
> Key: HADOOP-12483
> URL: https://issues.apache.org/jira/browse/HADOOP-12483
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-12483.patch
>
>
> A SASL encryption algorithm (wrapping) may have a required ordering for 
> encrypted responses.  The IPC layer encrypts when the response is set based 
> on the assumption it is being immediately sent.  Postponed responses violate 
> that assumption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses

2016-11-01 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10300:
---
   Resolution: Fixed
Fix Version/s: 2.7.4
   Status: Resolved  (was: Patch Available)

I just committed the patch to branch-2.7.

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10300:
---
Attachment: (was: HADOOP-10300-branch-2.7.0.patch)

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10300:
---
Attachment: HADOOP-10300-branch-2.7.patch

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10300:
---
Attachment: HADOOP-10300-branch-2.7.0.patch

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Reopened] (HADOOP-10300) Allowed deferred sending of call responses

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang reopened HADOOP-10300:


I think this'd be a good addition to branch-2.7; all other subtasks under the 
umbrella JIRA are actually in 2.3. Attaching a branch-2.7 patch to trigger 
Jenkins.

[~daryn] [~kihwal] LMK if you have any concerns about the backport.

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10300:
---
Status: Patch Available  (was: Reopened)

> Allowed deferred sending of call responses
> --
>
> Key: HADOOP-10300
> URL: https://issues.apache.org/jira/browse/HADOOP-10300
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: ipc
>Affects Versions: 3.0.0-alpha1, 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, 
> HADOOP-10300.patch, HADOOP-10300.patch
>
>
> RPC handlers currently do not return until the RPC call completes and 
> response is sent, or a partially sent response has been queued for the 
> responder.  It would be useful for a proxy method to notify the handler to 
> not yet the send the call's response.
> An potential use case is a namespace handler in the NN might want to return 
> before the edit log is synced so it can service more requests and allow 
> increased batching of edits per sync.  Background syncing could later trigger 
> the sending of the call response to the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12325:
---
   Resolution: Fixed
Fix Version/s: 2.7.4
   Status: Resolved  (was: Patch Available)

I verified test failures and pushed to branch-2.7.

> RPC Metrics : Add the ability track and log slow RPCs
> -
>
> Key: HADOOP-12325
> URL: https://issues.apache.org/jira/browse/HADOOP-12325
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc, metrics
>Affects Versions: 2.7.1
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: Callers of WritableRpcEngine.call.png, 
> HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, 
> HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, 
> HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch
>
>
> This JIRA proposes to add a counter called RpcSlowCalls and also a 
> configuration setting that allows users to log really slow RPCs.  Slow RPCs 
> are RPCs that fall at 99th percentile. This is useful to troubleshoot why 
> certain services like name node freezes under heavy load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10597:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> RPC Server signals backoff to clients when all request queues are full
> --
>
> Key: HADOOP-10597
> URL: https://issues.apache.org/jira/browse/HADOOP-10597
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, 
> HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, 
> HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, 
> MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf
>
>
> Currently if an application hits NN too hard, RPC requests be in blocking 
> state, assuming OS connection doesn't run out. Alternatively RPC or NN can 
> throw some well defined exception back to the client based on certain 
> policies when it is under heavy load; client will understand such exception 
> and do exponential back off, as another implementation of 
> RetryInvocationHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full

2016-10-31 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10597:
---
Fix Version/s: 2.7.4

Thanks Ming for confirming this. I verified reported test failures (cannot 
reproduce locally) and pushed to branch-2.7.

> RPC Server signals backoff to clients when all request queues are full
> --
>
> Key: HADOOP-10597
> URL: https://issues.apache.org/jira/browse/HADOOP-10597
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, 
> HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, 
> HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, 
> MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf
>
>
> Currently if an application hits NN too hard, RPC requests be in blocking 
> state, assuming OS connection doesn't run out. Alternatively RPC or NN can 
> throw some well defined exception back to the client based on certain 
> policies when it is under heavy load; client will understand such exception 
> and do exponential back off, as another implementation of 
> RetryInvocationHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full

2016-10-28 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10597:
---
Status: Patch Available  (was: Reopened)

> RPC Server signals backoff to clients when all request queues are full
> --
>
> Key: HADOOP-10597
> URL: https://issues.apache.org/jira/browse/HADOOP-10597
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, 
> HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, 
> HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, 
> MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf
>
>
> Currently if an application hits NN too hard, RPC requests be in blocking 
> state, assuming OS connection doesn't run out. Alternatively RPC or NN can 
> throw some well defined exception back to the client based on certain 
> policies when it is under heavy load; client will understand such exception 
> and do exponential back off, as another implementation of 
> RetryInvocationHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full

2016-10-28 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-10597:
---
Attachment: HADOOP-10597-branch-2.7.patch

> RPC Server signals backoff to clients when all request queues are full
> --
>
> Key: HADOOP-10597
> URL: https://issues.apache.org/jira/browse/HADOOP-10597
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, 
> HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, 
> HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, 
> MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf
>
>
> Currently if an application hits NN too hard, RPC requests be in blocking 
> state, assuming OS connection doesn't run out. Alternatively RPC or NN can 
> throw some well defined exception back to the client based on certain 
> policies when it is under heavy load; client will understand such exception 
> and do exponential back off, as another implementation of 
> RetryInvocationHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Reopened] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full

2016-10-28 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang reopened HADOOP-10597:


Thanks [~mingma] for the nice work!

Since the umbrella JIRA HADOOP-9640 (and the FairCallQueue feature) is 
available in 2.7 and 2.6, I think it makes sense to backport this to at least 
2.7. Reopening to test branch-2.7 patch. Please let me know if you have any 
concern about adding this to branch-2.7.

> RPC Server signals backoff to clients when all request queues are full
> --
>
> Key: HADOOP-10597
> URL: https://issues.apache.org/jira/browse/HADOOP-10597
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, 
> HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, 
> HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, 
> MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf
>
>
> Currently if an application hits NN too hard, RPC requests be in blocking 
> state, assuming OS connection doesn't run out. Alternatively RPC or NN can 
> throw some well defined exception back to the client based on certain 
> policies when it is under heavy load; client will understand such exception 
> and do exponential back off, as another implementation of 
> RetryInvocationHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13747) Use LongAdder for more efficient metrics tracking

2016-10-26 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609257#comment-15609257
 ] 

Zhe Zhang commented on HADOOP-13747:


Thanks Erik! It seems once again the discussion is leading to another JIRA 
(convert {{MutableRates}} to aggregate on read) :)

Benchmark results look good. I imagine the benefits of this optimization will 
be more significant when the number of thread increases -- e.g. 256 as used in 
some production clusters.

{{MutableRatesWithAggregation}} LGTM overall. The only structural concern I 
have is the assumption of long-lived threads. Right now {{MutableRates}} is 
only used by detailed RPC metrics so the assumption still holds. But it might 
limit its applicability as a general-purpose metrics class. I'm happy to have 
other people's opinions on this as well (whether we foresee any short-lived 
threads using {{MutableRates}}).

If we do want to support short-lived threads, an alternative is to use a 
similar idea as {{LongAdder}}, and use a set of variables to hold {{}} tuples. On snapshotting, apply this "log entries" one by one.


> Use LongAdder for more efficient metrics tracking
> -
>
> Key: HADOOP-13747
> URL: https://issues.apache.org/jira/browse/HADOOP-13747
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Zhe Zhang
>Assignee: Erik Krogen
> Attachments: HADOOP-13747.patch, benchmark_results
>
>
> Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, 
> use a synchronized counter to be updated by all handler threads (multiple 
> hundreds in large production clusters). As [~andrew.wang] suggested, it'd be 
> more efficient to use the [LongAdder | 
> http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co]
>  library which dynamically create intermediate-result variables.
> Assigning to [~xkrogen] who has already done some investigation on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs

2016-10-25 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12325:
---
Status: Patch Available  (was: Reopened)

> RPC Metrics : Add the ability track and log slow RPCs
> -
>
> Key: HADOOP-12325
> URL: https://issues.apache.org/jira/browse/HADOOP-12325
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc, metrics
>Affects Versions: 2.7.1
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: Callers of WritableRpcEngine.call.png, 
> HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, 
> HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, 
> HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch
>
>
> This JIRA proposes to add a counter called RpcSlowCalls and also a 
> configuration setting that allows users to log really slow RPCs.  Slow RPCs 
> are RPCs that fall at 99th percentile. This is useful to troubleshoot why 
> certain services like name node freezes under heavy load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs

2016-10-25 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12325:
---
Attachment: HADOOP-12325-branch-2.7.00.patch

> RPC Metrics : Add the ability track and log slow RPCs
> -
>
> Key: HADOOP-12325
> URL: https://issues.apache.org/jira/browse/HADOOP-12325
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc, metrics
>Affects Versions: 2.7.1
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: Callers of WritableRpcEngine.call.png, 
> HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, 
> HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, 
> HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch
>
>
> This JIRA proposes to add a counter called RpcSlowCalls and also a 
> configuration setting that allows users to log really slow RPCs.  Slow RPCs 
> are RPCs that fall at 99th percentile. This is useful to troubleshoot why 
> certain services like name node freezes under heavy load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Reopened] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs

2016-10-25 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang reopened HADOOP-12325:


Sorry to reopen the JIRA. I think it is a good addition to branch-2.7 and want 
to test branch-2.7 patch.

> RPC Metrics : Add the ability track and log slow RPCs
> -
>
> Key: HADOOP-12325
> URL: https://issues.apache.org/jira/browse/HADOOP-12325
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc, metrics
>Affects Versions: 2.7.1
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: Callers of WritableRpcEngine.call.png, 
> HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, 
> HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, 
> HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch
>
>
> This JIRA proposes to add a counter called RpcSlowCalls and also a 
> configuration setting that allows users to log really slow RPCs.  Slow RPCs 
> are RPCs that fall at 99th percentile. This is useful to troubleshoot why 
> certain services like name node freezes under heavy load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13747) Use LongAdder for more efficient metrics tracking

2016-10-21 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13747:
---
Description: 
Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, 
use a synchronized counter to be updated by all handler threads (multiple 
hundreds in large production clusters). As [~andrew.wang] suggested, it'd be 
more efficient to use the [LongAdder | 
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co]
 library which dynamically create intermediate-result variables.

Assigning to [~xkrogen] who has already done some investigation on this.

  was:
Currently most metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, 
use a synchronized counter to be updated by all handler threads (multiple 
hundreds in large production clusters). As [~andrew.wang] suggested, it'd be 
more efficient to use the [LongAdder | 
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co]
 library which dynamically create intermediate-result variables.

Assigning to [~xkrogen] who has already done some investigation on this.


> Use LongAdder for more efficient metrics tracking
> -
>
> Key: HADOOP-13747
> URL: https://issues.apache.org/jira/browse/HADOOP-13747
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Zhe Zhang
>Assignee: Erik Krogen
>
> Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, 
> use a synchronized counter to be updated by all handler threads (multiple 
> hundreds in large production clusters). As [~andrew.wang] suggested, it'd be 
> more efficient to use the [LongAdder | 
> http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co]
>  library which dynamically create intermediate-result variables.
> Assigning to [~xkrogen] who has already done some investigation on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13747) Use LongAdder for more efficient metrics tracking

2016-10-21 Thread Zhe Zhang (JIRA)
Zhe Zhang created HADOOP-13747:
--

 Summary: Use LongAdder for more efficient metrics tracking
 Key: HADOOP-13747
 URL: https://issues.apache.org/jira/browse/HADOOP-13747
 Project: Hadoop Common
  Issue Type: Improvement
  Components: metrics
Reporter: Zhe Zhang
Assignee: Erik Krogen


Currently most metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, 
use a synchronized counter to be updated by all handler threads (multiple 
hundreds in large production clusters). As [~andrew.wang] suggested, it'd be 
more efficient to use the [LongAdder | 
http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co]
 library which dynamically create intermediate-result variables.

Assigning to [~xkrogen] who has already done some investigation on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12259) Utility to Dynamic port allocation

2016-10-18 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586074#comment-15586074
 ] 

Zhe Zhang commented on HADOOP-12259:


Thanks for the work Brahma. This is a good addition to branch-2.7 and I just 
did the backport.

> Utility to Dynamic port allocation
> --
>
> Key: HADOOP-12259
> URL: https://issues.apache.org/jira/browse/HADOOP-12259
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: test, util
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-12259.patch
>
>
> As per discussion in YARN-3528 and [~rkanter] comment [here | 
> https://issues.apache.org/jira/browse/YARN-3528?focusedCommentId=14637700=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14637700
>  ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12259) Utility to Dynamic port allocation

2016-10-18 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12259:
---
Fix Version/s: 2.7.4

> Utility to Dynamic port allocation
> --
>
> Key: HADOOP-12259
> URL: https://issues.apache.org/jira/browse/HADOOP-12259
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: test, util
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HADOOP-12259.patch
>
>
> As per discussion in YARN-3528 and [~rkanter] comment [here | 
> https://issues.apache.org/jira/browse/YARN-3528?focusedCommentId=14637700=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14637700
>  ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13558) UserGroupInformation created from a Subject incorrectly tries to renew the Kerberos ticket

2016-10-11 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567020#comment-15567020
 ] 

Zhe Zhang commented on HADOOP-13558:


Thanks much Xiao!

> UserGroupInformation created from a Subject incorrectly tries to renew the 
> Kerberos ticket
> --
>
> Key: HADOOP-13558
> URL: https://issues.apache.org/jira/browse/HADOOP-13558
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.7.2, 2.6.4, 3.0.0-alpha2
>Reporter: Alejandro Abdelnur
>Assignee: Xiao Chen
> Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-13558.01.patch, HADOOP-13558.02.patch, 
> HADOOP-13558.branch-2.7.patch
>
>
> The UGI {{checkTGTAndReloginFromKeytab()}} method checks certain conditions 
> and if they are met it invokes the {{reloginFromKeytab()}}. The 
> {{reloginFromKeytab()}} method then fails with an {{IOException}} 
> "loginUserFromKeyTab must be done first" because there is no keytab 
> associated with the UGI.
> The {{checkTGTAndReloginFromKeytab()}} method checks if there is a keytab 
> ({{isKeytab}} UGI instance variable) associated with the UGI, if there is one 
> it triggers a call to {{reloginFromKeytab()}}. The problem is that the 
> {{keytabFile}} UGI instance variable is NULL, and that triggers the mentioned 
> {{IOException}}.
> The root of the problem seems to be when creating a UGI via the 
> {{UGI.loginUserFromSubject(Subject)}} method, this method uses the 
> {{UserGroupInformation(Subject)}} constructor, and this constructor does the 
> following to determine if there is a keytab or not.
> {code}
>   this.isKeytab = KerberosUtil.hasKerberosKeyTab(subject);
> {code}
> If the {{Subject}} given had a keytab, then the UGI instance will have the 
> {{isKeytab}} set to TRUE.
> It sets the UGI instance as it would have a keytab because the Subject has a 
> keytab. This has 2 problems:
> First, it does not set the keytab file (and this, having the {{isKeytab}} set 
> to TRUE and the {{keytabFile}} set to NULL) is what triggers the 
> {{IOException}} in the method {{reloginFromKeytab()}}.
> Second (and even if the first problem is fixed, this still is a problem), it 
> assumes that because the subject has a keytab it is up to UGI to do the 
> relogin using the keytab. This is incorrect if the UGI was created using the 
> {{UGI.loginUserFromSubject(Subject)}} method. In such case, the owner of the 
> Subject is not the UGI, but the caller, so the caller is responsible for 
> renewing the Kerberos tickets and the UGI should not try to do so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13378) Common features between YARN and HDFS Router-based federation

2016-10-05 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13378:
---
Description: HDFS-10467 uses a similar architecture to the one proposed in 
YARN-2915. This JIRA tries to identify what is common between these two efforts 
and try to build a common framework.  (was: HDFS-10647 uses a similar 
architecture to the one proposed in YARN-2915. This JIRA tries to identify what 
is common between these two efforts and try to build a common framework.)

> Common features between YARN and HDFS Router-based federation
> -
>
> Key: HADOOP-13378
> URL: https://issues.apache.org/jira/browse/HADOOP-13378
> Project: Hadoop Common
>  Issue Type: New Feature
>Reporter: Inigo Goiri
>
> HDFS-10467 uses a similar architecture to the one proposed in YARN-2915. This 
> JIRA tries to identify what is common between these two efforts and try to 
> build a common framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13061) Refactor erasure coders

2016-10-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13061:
---
Labels: hdfs-ec-3.0-must-do  (was: )

> Refactor erasure coders
> ---
>
> Key: HADOOP-13061
> URL: https://issues.apache.org/jira/browse/HADOOP-13061
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Rui Li
>Assignee: Kai Sasaki
>  Labels: hdfs-ec-3.0-must-do
> Attachments: HADOOP-13061.01.patch, HADOOP-13061.02.patch, 
> HADOOP-13061.03.patch, HADOOP-13061.04.patch, HADOOP-13061.05.patch, 
> HADOOP-13061.06.patch, HADOOP-13061.07.patch, HADOOP-13061.08.patch, 
> HADOOP-13061.09.patch, HADOOP-13061.10.patch, HADOOP-13061.11.patch, 
> HADOOP-13061.12.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13665) Erasure Coding codec should support fallback coder

2016-10-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13665:
---
Labels: hdfs-ec-3.0-must-do  (was: )

> Erasure Coding codec should support fallback coder
> --
>
> Key: HADOOP-13665
> URL: https://issues.apache.org/jira/browse/HADOOP-13665
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: io
>Reporter: Wei-Chiu Chuang
>  Labels: hdfs-ec-3.0-must-do
>
> The current EC codec supports a single coder only (by default pure Java 
> implementation). If the native coder is specified but is unavailable, it 
> should fallback to pure Java implementation.
> One possible solution is to follow the convention of existing Hadoop native 
> codec, such as transport encryption (see {{CryptoCodec.java}}). It supports 
> fallback by specifying two or multiple coders as the value of property, and 
> loads coders in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13200) Seeking a better approach allowing to customize and configure erasure coders

2016-10-04 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13200:
---
Labels: hdfs-ec-3.0-must-do  (was: )

> Seeking a better approach allowing to customize and configure erasure coders
> 
>
> Key: HADOOP-13200
> URL: https://issues.apache.org/jira/browse/HADOOP-13200
> Project: Hadoop Common
>  Issue Type: Sub-task
>Reporter: Kai Zheng
>Assignee: Kai Zheng
>  Labels: hdfs-ec-3.0-must-do
>
> This is a follow-on task for HADOOP-13010 as discussed over there. There may 
> be some better approach allowing to customize and configure erasure coders 
> than the current having raw coder factory, as [~cmccabe] suggested. Will copy 
> the relevant comments here to continue the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-10-03 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Assignee: (was: Zhe Zhang)

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
> Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, 
> HADOOP-13055.02.patch
>
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-10-03 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543769#comment-15543769
 ] 

Zhe Zhang commented on HADOOP-13055:


Sorry for getting back to this late.

[~shv] Yes the patch only implements {{linkMergeSlash}} instead of 
{{linkMerge}} in general.

[~manojg] Thanks for the interest! Yes it would be great if you can take over 
this task. Unassigning myself now. I'll get back to your question after 
refreshing my own memory on the patch.

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, 
> HADOOP-13055.02.patch
>
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-26 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524152#comment-15524152
 ] 

Zhe Zhang commented on HADOOP-13657:


Thanks [~kihwal]. Linking the issue for now. I think in these 2 issues 
{{Reader}} died for different reasons, but maybe the solution is similar. I 
don't have a patch either.

> IPC Reader thread could silently die and leave NameNode unresponsive
> 
>
> Key: HADOOP-13657
> URL: https://issues.apache.org/jira/browse/HADOOP-13657
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Reporter: Zhe Zhang
>Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
> charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
> 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
> died from runtime exception. Then the {{pendingConnections}} queue became 
> full and the NameNode port became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by 
> https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
> of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of 
> runtime exceptions, or at least treat it as a FATAL exception so that 
> NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-26 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13657:
---
Description: 
For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
died from runtime exception. Then the {{pendingConnections}} queue became full 
and the NameNode port became inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by 
https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of 
runtime exceptions, or at least treat it as a FATAL exception so that NameNode 
can fail over to standby, and admins get alerted of the real issue.

  was:
For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
died from run time exception. Then the {{pendingConnections}} queue became full 
and the NameNode port became inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by 
https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of 
runtime exceptions, or at least treat it as a FATAL exception so that NameNode 
can fail over to standby, and admins get alerted of the real issue.


> IPC Reader thread could silently die and leave NameNode unresponsive
> 
>
> Key: HADOOP-13657
> URL: https://issues.apache.org/jira/browse/HADOOP-13657
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: ipc
>Reporter: Zhe Zhang
>Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
> charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
> 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
> died from runtime exception. Then the {{pendingConnections}} queue became 
> full and the NameNode port became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by 
> https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
> of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of 
> runtime exceptions, or at least treat it as a FATAL exception so that 
> NameNode can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive

2016-09-26 Thread Zhe Zhang (JIRA)
Zhe Zhang created HADOOP-13657:
--

 Summary: IPC Reader thread could silently die and leave NameNode 
unresponsive
 Key: HADOOP-13657
 URL: https://issues.apache.org/jira/browse/HADOOP-13657
 Project: Hadoop Common
  Issue Type: Bug
  Components: ipc
Reporter: Zhe Zhang
Priority: Critical


For each listening port, IPC {{Server#Listener#Reader}} is a single thread in 
charge of moving {{Connection}} items from {{pendingConnections}} (capacity 
100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode 
died from run time exception. Then the {{pendingConnections}} queue became full 
and the NameNode port became inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by 
https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types 
of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of 
runtime exceptions, or at least treat it as a FATAL exception so that NameNode 
can fail over to standby, and admins get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13558) UserGroupInformation created from a Subject incorrectly tries to renew the Kerberos ticket

2016-09-09 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478664#comment-15478664
 ] 

Zhe Zhang commented on HADOOP-13558:


[~xiaochen] Thanks for the fix! Since it affects earlier versions do you mind 
porting it to branch-2.7? I can also help with that.

> UserGroupInformation created from a Subject incorrectly tries to renew the 
> Kerberos ticket
> --
>
> Key: HADOOP-13558
> URL: https://issues.apache.org/jira/browse/HADOOP-13558
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.7.2, 2.6.4, 3.0.0-alpha2
>Reporter: Alejandro Abdelnur
>Assignee: Xiao Chen
> Fix For: 2.9.0, 3.0.0-alpha2
>
> Attachments: HADOOP-13558.01.patch, HADOOP-13558.02.patch
>
>
> The UGI {{checkTGTAndReloginFromKeytab()}} method checks certain conditions 
> and if they are met it invokes the {{reloginFromKeytab()}}. The 
> {{reloginFromKeytab()}} method then fails with an {{IOException}} 
> "loginUserFromKeyTab must be done first" because there is no keytab 
> associated with the UGI.
> The {{checkTGTAndReloginFromKeytab()}} method checks if there is a keytab 
> ({{isKeytab}} UGI instance variable) associated with the UGI, if there is one 
> it triggers a call to {{reloginFromKeytab()}}. The problem is that the 
> {{keytabFile}} UGI instance variable is NULL, and that triggers the mentioned 
> {{IOException}}.
> The root of the problem seems to be when creating a UGI via the 
> {{UGI.loginUserFromSubject(Subject)}} method, this method uses the 
> {{UserGroupInformation(Subject)}} constructor, and this constructor does the 
> following to determine if there is a keytab or not.
> {code}
>   this.isKeytab = KerberosUtil.hasKerberosKeyTab(subject);
> {code}
> If the {{Subject}} given had a keytab, then the UGI instance will have the 
> {{isKeytab}} set to TRUE.
> It sets the UGI instance as it would have a keytab because the Subject has a 
> keytab. This has 2 problems:
> First, it does not set the keytab file (and this, having the {{isKeytab}} set 
> to TRUE and the {{keytabFile}} set to NULL) is what triggers the 
> {{IOException}} in the method {{reloginFromKeytab()}}.
> Second (and even if the first problem is fixed, this still is a problem), it 
> assumes that because the subject has a keytab it is up to UGI to do the 
> relogin using the keytab. This is incorrect if the UGI was created using the 
> {{UGI.loginUserFromSubject(Subject)}} method. In such case, the owner of the 
> Subject is not the UGI, but the caller, so the caller is responsible for 
> renewing the Kerberos tickets and the UGI should not try to do so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13535) Add jetty6 acceptor startup issue workaround to branch-2

2016-08-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446690#comment-15446690
 ] 

Zhe Zhang commented on HADOOP-13535:


Thanks Min for taking on this work. Please try again.

> Add jetty6 acceptor startup issue workaround to branch-2
> 
>
> Key: HADOOP-13535
> URL: https://issues.apache.org/jira/browse/HADOOP-13535
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Wei-Chiu Chuang
>Assignee: Min Shen
>
> After HADOOP-12765 is committed to branch-2, the handling of SSL connection 
> by HttpServer2 may suffer the same Jetty bug described in HADOOP-10588. We 
> should consider adding the same workaround for SSL connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13535) Add jetty6 acceptor startup issue workaround to branch-2

2016-08-29 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13535:
---
Assignee: Min Shen

> Add jetty6 acceptor startup issue workaround to branch-2
> 
>
> Key: HADOOP-13535
> URL: https://issues.apache.org/jira/browse/HADOOP-13535
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Wei-Chiu Chuang
>Assignee: Min Shen
>
> After HADOOP-12765 is committed to branch-2, the handling of SSL connection 
> by HttpServer2 may suffer the same Jetty bug described in HADOOP-10588. We 
> should consider adding the same workaround for SSL connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-29 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-29 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446455#comment-15446455
 ] 

Zhe Zhang commented on HADOOP-12765:


Thanks for the feedback [~jojochuang]. I resolved both conflicts and backported 
this change to branch-2.7. Agreed HADOOP-12688 would be a nice improvement. I 
tried backporting but it was not quite clean.

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-29 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Fix Version/s: 2.7.4

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-24 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Attachment: HADOOP-13055.02.patch

Updating patch to fix unit test failure, and improve {{resolve]} logic.

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, 
> HADOOP-13055.02.patch
>
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Fix Version/s: 2.9.0

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433872#comment-15433872
 ] 

Zhe Zhang commented on HADOOP-12765:


I committed to branch-2 and branch-2.8. But backporting to branch-2.7 is having 
a conflict on the pom files. [~mshen] [~jojochuang] Could you help take a look? 
Thanks.

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12668) Support excluding weak Ciphers in HttpServer2 through ssl-server.conf

2016-08-23 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433735#comment-15433735
 ] 

Zhe Zhang commented on HADOOP-12668:


I cherry-picked this to branch-2.7 in support of HADOOP-12765

> Support excluding weak Ciphers in HttpServer2 through ssl-server.conf 
> --
>
> Key: HADOOP-12668
> URL: https://issues.apache.org/jira/browse/HADOOP-12668
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 2.7.1
>Reporter: Vijay Singh
>Assignee: Vijay Singh
>Priority: Critical
>  Labels: common, ha, hadoop, hdfs, security
> Fix For: 2.8.0, 2.7.4
>
> Attachments: Hadoop-12668.006.patch, Hadoop-12668.007.patch, 
> Hadoop-12668.008.patch, Hadoop-12668.009.patch, Hadoop-12668.010.patch, 
> Hadoop-12668.011.patch, Hadoop-12668.012.patch, test.log
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently Embeded jetty Server used across all hadoop services is configured 
> through ssl-server.xml file from their respective configuration section. 
> However, the SSL/TLS protocol being used for this jetty servers can be 
> downgraded to weak cipher suites. This code changes aims to add following 
> functionality:
> 1) Add logic in hadoop common (HttpServer2.java and associated interfaces) to 
> spawn jetty servers with ability to exclude weak cipher suites. I propose we 
> make this though ssl-server.xml and hence each service can choose to disable 
> specific ciphers.
> 2) Modify DFSUtil.java used by HDFS code to supply new parameter 
> ssl.server.exclude.cipher.list for hadoop-common code, so it can exclude the 
> ciphers supplied through this key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12668) Support excluding weak Ciphers in HttpServer2 through ssl-server.conf

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12668:
---
Fix Version/s: 2.7.4

> Support excluding weak Ciphers in HttpServer2 through ssl-server.conf 
> --
>
> Key: HADOOP-12668
> URL: https://issues.apache.org/jira/browse/HADOOP-12668
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 2.7.1
>Reporter: Vijay Singh
>Assignee: Vijay Singh
>Priority: Critical
>  Labels: common, ha, hadoop, hdfs, security
> Fix For: 2.8.0, 2.7.4
>
> Attachments: Hadoop-12668.006.patch, Hadoop-12668.007.patch, 
> Hadoop-12668.008.patch, Hadoop-12668.009.patch, Hadoop-12668.010.patch, 
> Hadoop-12668.011.patch, Hadoop-12668.012.patch, test.log
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently Embeded jetty Server used across all hadoop services is configured 
> through ssl-server.xml file from their respective configuration section. 
> However, the SSL/TLS protocol being used for this jetty servers can be 
> downgraded to weak cipher suites. This code changes aims to add following 
> functionality:
> 1) Add logic in hadoop common (HttpServer2.java and associated interfaces) to 
> spawn jetty servers with ability to exclude weak cipher suites. I propose we 
> make this though ssl-server.xml and hence each service can choose to disable 
> specific ciphers.
> 2) Modify DFSUtil.java used by HDFS code to supply new parameter 
> ssl.server.exclude.cipher.list for hadoop-common code, so it can exclude the 
> ciphers supplied through this key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Fix Version/s: 2.8.0

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Target Version/s: 2.7.4  (was: 2.9.0)

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433671#comment-15433671
 ] 

Zhe Zhang commented on HADOOP-12765:


Just noticed that the branch-2 patch has already passed Jenkins. +1. I will 
commit shortly.

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Attachment: HADOOP-13055.01.patch

Updating patch:
# Fixing a bug in initializing {{root}} which caused the unit test failures
# Enforcing that merge slash and regular links don't co-exist
# Add unit test for above

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch
>
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Attachment: HADOOP-13055.00.patch

Pretty rough initial patch to test whether the idea breaks any existing unit 
tests. I'm still working on:
# Enforcing {{linkMergeSlash}} is not used together with regular links
# An issue on {{ViewFileSystem#getFileStatus}} causing the returned status to 
have a wrong path (the {{LocatedFileStatus}} that it wraps is correct)
# More unit tests

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13055.00.patch
>
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Status: Patch Available  (was: Open)

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-23 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433234#comment-15433234
 ] 

Zhe Zhang commented on HADOOP-13055:


I think we need to make two changes to {{InodeTree}}:
# {{root}} of a mount table can either be an {{INodeDir}} or an {{INodeLink}}. 
So we should make it an {{INode}} and assign its value after checking the 
configurations (at the end of the for loop in the constructor).
# Enforce that when {{linkMergeSlash}} is configured, no other links can be 
configured for that mount table

I'm writing a patch to implement the above. Any thoughts are very welcome.

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433134#comment-15433134
 ] 

Zhe Zhang commented on HADOOP-12765:


Thanks [~jojochuang]. Branch-2 patch LGTM. +1 pending Jenkins.

The conflict is caused by HADOOP-10588. It's only in branch-2, not trunk. I'll 
file a JIRA to address.

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections

2016-08-23 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-12765:
---
Fix Version/s: 3.0.0-alpha2

> HttpServer2 should switch to using the non-blocking SslSelectChannelConnector 
> to prevent performance degradation when handling SSL connections
> --
>
> Key: HADOOP-12765
> URL: https://issues.apache.org/jira/browse/HADOOP-12765
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.7.2, 2.6.3
>Reporter: Min Shen
>Assignee: Min Shen
> Fix For: 3.0.0-alpha2
>
> Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, 
> HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, 
> HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, 
> blocking_2.png, unblocking.png
>
>
> The current implementation uses the blocking SslSocketConnector which takes 
> the default maxIdleTime as 200 seconds. We noticed in our cluster that when 
> users use a custom client that accesses the WebHDFS REST APIs through https, 
> it could block all the 250 handler threads in NN jetty server, causing severe 
> performance degradation for accessing WebHDFS and NN web UI. Attached 
> screenshots (blocking_1.png and blocking_2.png) illustrate that when using 
> SslSocketConnector, the jetty handler threads are not released until the 200 
> seconds maxIdleTime has passed. With sufficient number of SSL connections, 
> this issue could render NN HttpServer to become entirely irresponsive.
> We propose to use the non-blocking SslSelectChannelConnector as a fix. We 
> have deployed the attached patch within our cluster, and have seen 
> significant improvement. The attached screenshot (unblocking.png) further 
> illustrates the behavior of NN jetty server after switching to using 
> SslSelectChannelConnector.
> The patch further disables SSLv3 protocol on server side to preserve the 
> spirit of HADOOP-11260.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-19 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Comment: was deleted

(was: Actually, after looking more at the code, what I really want is the 
ability to mount the root of an HDFS to some path in the mount table (not 
necessarily the root of the mount table). This is already possible. Unassigning 
myself from this JIRA.)

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-19 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13055:
---
Target Version/s: 2.7.4  (was: 2.8.0)

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Assigned] (HADOOP-13055) Implement linkMergeSlash for ViewFs

2016-08-19 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang reassigned HADOOP-13055:
--

Assignee: Zhe Zhang

> Implement linkMergeSlash for ViewFs
> ---
>
> Key: HADOOP-13055
> URL: https://issues.apache.org/jira/browse/HADOOP-13055
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: fs, viewfs
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>
> In a multi-cluster environment it is sometimes useful to operate on the root 
> / slash directory of an HDFS cluster. E.g., list all top level directories. 
> Quoting the comment in {{ViewFs}}:
> {code}
>  *   A special case of the merge mount is where mount table's root is merged
>  *   with the root (slash) of another file system:
>  *   
>  *   fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/
>  *   
>  *   In this cases the root of the mount table is merged with the root of
>  *hdfs://nn99/  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13408) TestUTF8 fails in branch-2.6

2016-07-22 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13408:
---
Assignee: Ye Zhou

> TestUTF8 fails in branch-2.6
> 
>
> Key: HADOOP-13408
> URL: https://issues.apache.org/jira/browse/HADOOP-13408
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: test
>Reporter: Zhe Zhang
>Assignee: Ye Zhou
>Priority: Minor
>  Labels: newbie
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Moved] (HADOOP-13408) TestUTF8 fails in branch-2.6

2016-07-22 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang moved HDFS-10680 to HADOOP-13408:
---

Target Version/s: 2.6.5  (was: 2.6.5)
 Component/s: (was: test)
  test
 Key: HADOOP-13408  (was: HDFS-10680)
 Project: Hadoop Common  (was: Hadoop HDFS)

> TestUTF8 fails in branch-2.6
> 
>
> Key: HADOOP-13408
> URL: https://issues.apache.org/jira/browse/HADOOP-13408
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: test
>Reporter: Zhe Zhang
>Priority: Minor
>  Labels: newbie
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client

2016-07-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386692#comment-15386692
 ] 

Zhe Zhang commented on HADOOP-13206:


Thanks much for the suggestion [~cnauroth].

Although in our case the issue happened with 2.6 and 2.3 clients, now I think 
it can happen with the same version of client for 2 reasons. Let's assume there 
are client {{A}}, which fetches tokens, and client {{B}}, which uses tokens.
# Client {{A}} and client {{B}} could use different values of 
{{hadoop.security.token.service.use_ip}}. Should we treat this as a 
mis-configuration and enforce the same value across any entire production 
environment?
# Client {{A}}, when fetching the token, could use numerical IP address to 
refer to the NameNode, such as {{webhdfs://123.45.67.89:50070}}. Client {{B}}, 
when using the token, could use a logical URI {{webhdfs://clusterNN}}.

Good point about DNS overhead. How about we update the patch and only do the 
newly added check if one URI is logical and the other is not?

> Delegation token cannot be fetched and used by different versions of client
> ---
>
> Key: HADOOP-13206
> URL: https://issues.apache.org/jira/browse/HADOOP-13206
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.3.0, 2.6.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, 
> HADOOP-13206.02.patch
>
>
> We have observed that an HDFS delegation token fetched by a 2.3.0 client 
> cannot be used by a 2.6.1 client, and vice versa. Through some debugging I 
> found that it's a mismatch between the token's {{service}} and the 
> {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). 
> One would be in numerical IP address and one would be in non-numerical 
> hostname format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client

2016-07-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386428#comment-15386428
 ] 

Zhe Zhang commented on HADOOP-13206:


Thanks for the review [~shv]. I'll post a patch to address soon.

> Delegation token cannot be fetched and used by different versions of client
> ---
>
> Key: HADOOP-13206
> URL: https://issues.apache.org/jira/browse/HADOOP-13206
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.3.0, 2.6.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, 
> HADOOP-13206.02.patch
>
>
> We have observed that an HDFS delegation token fetched by a 2.3.0 client 
> cannot be used by a 2.6.1 client, and vice versa. Through some debugging I 
> found that it's a mismatch between the token's {{service}} and the 
> {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). 
> One would be in numerical IP address and one would be in non-numerical 
> hostname format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client

2016-07-20 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386426#comment-15386426
 ] 

Zhe Zhang commented on HADOOP-13206:


Just found HADOOP-7733 had some very relevant discussions on this issue.

[~cnauroth], [~daryn]: could you take a look at the patch posted here too? 
Thanks.

I don't this is necessarily a mis-config, because a token can potentially be 
fetched and used by different (and maybe different versions) of clients.

> Delegation token cannot be fetched and used by different versions of client
> ---
>
> Key: HADOOP-13206
> URL: https://issues.apache.org/jira/browse/HADOOP-13206
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.3.0, 2.6.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, 
> HADOOP-13206.02.patch
>
>
> We have observed that an HDFS delegation token fetched by a 2.3.0 client 
> cannot be used by a 2.6.1 client, and vice versa. Through some debugging I 
> found that it's a mismatch between the token's {{service}} and the 
> {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). 
> One would be in numerical IP address and one would be in non-numerical 
> hostname format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client

2016-07-19 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385049#comment-15385049
 ] 

Zhe Zhang commented on HADOOP-13206:


I did more debugging and found the reason why different version of client 
return different formats of {{service}}.

In *trunk*, {{WebHdfsFileSystem#getDelegationToken}} sets {{service}} as:
{code}
if (token != null) {
  token.setService(tokenServiceName);
{code}

{{tokenServiceName}} is set as following:
{code}
this.tokenServiceName = isLogicalUri ?
HAUtilClient.buildTokenServiceForLogicalUri(uri, getScheme())
: SecurityUtil.buildTokenService(getCanonicalUri());
{code}

This essentially will create a logical URI like {{webhdfs://myhost}}.

In *branch-2.3*, the logic is as below, which results in numerical IPs.
{code}
SecurityUtil.setTokenService(token, getCurrentNNAddr());
...
this.nnAddrs = DFSUtil.resolveWebHdfsUri(this.uri, conf);
...
  /**
   * Resolve an HDFS URL into real INetSocketAddress. It works like a DNS 
resolver
   * when the URL points to an non-HA cluster. When the URL points to an HA
   * cluster, the resolver further resolves the logical name (i.e., the 
authority
   * in the URL) into real namenode addresses.
   */
  public static InetSocketAddress[] resolveWebHdfsUri(URI uri, Configuration 
conf)
  throws IOException {
int defaultPort;
String scheme = uri.getScheme();
if (WebHdfsFileSystem.SCHEME.equals(scheme)) {
  defaultPort = DFSConfigKeys.DFS_NAMENODE_HTTP_PORT_DEFAULT;
} else if (SWebHdfsFileSystem.SCHEME.equals(scheme)) {
  defaultPort = DFSConfigKeys.DFS_NAMENODE_HTTPS_PORT_DEFAULT;
} else {
  throw new IllegalArgumentException("Unsupported scheme: " + scheme);
}

ArrayList ret = new ArrayList();

if (!HAUtil.isLogicalUri(conf, uri)) {
  InetSocketAddress addr = NetUtils.createSocketAddr(uri.getAuthority(),
  defaultPort);
  ret.add(addr);

} else {
  Map> addresses = DFSUtil
  .getHaNnWebHdfsAddresses(conf, scheme);

  for (Map addrs : addresses.values()) {
for (InetSocketAddress addr : addrs.values()) {
  ret.add(addr);
}
  }
}

InetSocketAddress[] r = new InetSocketAddress[ret.size()];
return ret.toArray(r);
{code}

It's hard to add a unit test because we can't emulate a version 2.3 client in 
trunk code. But hope the above explanation is clear enough.

> Delegation token cannot be fetched and used by different versions of client
> ---
>
> Key: HADOOP-13206
> URL: https://issues.apache.org/jira/browse/HADOOP-13206
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.3.0, 2.6.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, 
> HADOOP-13206.02.patch
>
>
> We have observed that an HDFS delegation token fetched by a 2.3.0 client 
> cannot be used by a 2.6.1 client, and vice versa. Through some debugging I 
> found that it's a mismatch between the token's {{service}} and the 
> {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). 
> One would be in numerical IP address and one would be in non-numerical 
> hostname format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-13289) Remove unused variables in TestFairCallQueue

2016-07-15 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380226#comment-15380226
 ] 

Zhe Zhang commented on HADOOP-13289:


Thanks Ye for the patch and Akira for reviewing / committing. I'd like to 
backport this to 2.6 but from 2.8 to 2.7 the cherry-pick is not clean.

[~zhouyejoe] Do you mind posting a patch for branch-2.7? Thanks.

> Remove unused variables in TestFairCallQueue
> 
>
> Key: HADOOP-13289
> URL: https://issues.apache.org/jira/browse/HADOOP-13289
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 2.6.0
>Reporter: Konstantin Shvachko
>Assignee: Ye Zhou
>Priority: Minor
>  Labels: newbie
> Fix For: 2.8.0
>
> Attachments: HADOOP-13289.001.patch
>
>
> # Remove unused member {{alwaysZeroScheduler}} and related initialization in 
> {{TestFairCallQueue}}
> # Remove unused local vriable {{sched}} in 
> {{testOfferSucceedsWhenScheduledLowPriority()}}
> And propagate to applicable release branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-13290) Appropriate use of generics in FairCallQueue

2016-07-13 Thread Zhe Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HADOOP-13290:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0-alpha1
   2.6.5
   2.9.0
   2.7.3
   2.8.0
   Status: Resolved  (was: Patch Available)

Thanks Jonathan for the fix and Konstantin for the review. I just committed the 
patch to trunk~branch2.6.

> Appropriate use of generics in FairCallQueue
> 
>
> Key: HADOOP-13290
> URL: https://issues.apache.org/jira/browse/HADOOP-13290
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: ipc
>Affects Versions: 2.6.0
>Reporter: Konstantin Shvachko
>Assignee: Jonathan Hung
>  Labels: newbie++
> Fix For: 2.8.0, 2.7.3, 2.9.0, 2.6.5, 3.0.0-alpha1
>
> Attachments: HADOOP-13290.001.patch, HADOOP-13290.002.patch
>
>
> # {{BlockingQueue}} is intermittently used with and without generic 
> parameters in {{FairCallQueue}} class. Should be parameterized.
> # Same for {{FairCallQueue}}. Should be parameterized. Could be a bit more 
> tricky for that one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



  1   2   3   >