[jira] [Comment Edited] (HADOOP-12862) LDAP Group Mapping over SSL can not specify trust store
[ https://issues.apache.org/jira/browse/HADOOP-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419252#comment-16419252 ] Zhe Zhang edited comment on HADOOP-12862 at 3/29/18 4:03 PM: - Thanks [~shv]. v9 patch LGTM. +1 was (Author: zhz): Thanks [~shv]. +1 on v9 patch. > LDAP Group Mapping over SSL can not specify trust store > --- > > Key: HADOOP-12862 > URL: https://issues.apache.org/jira/browse/HADOOP-12862 > Project: Hadoop Common > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Major > Labels: release-blocker > Attachments: HADOOP-12862.001.patch, HADOOP-12862.002.patch, > HADOOP-12862.003.patch, HADOOP-12862.004.patch, HADOOP-12862.005.patch, > HADOOP-12862.006.patch, HADOOP-12862.007.patch, HADOOP-12862.008.patch, > HADOOP-12862.009.patch > > > In a secure environment, SSL is used to encrypt LDAP request for group > mapping resolution. > We (+[~yoderme], +[~tgrayson]) have found that its implementation is strange. > For information, Hadoop name node, as an LDAP client, talks to a LDAP server > to resolve the group mapping of a user. In the case of LDAP over SSL, a > typical scenario is to establish one-way authentication (the client verifies > the server's certificate is real) by storing the server's certificate in the > client's truststore. > A rarer scenario is to establish two-way authentication: in addition to store > truststore for the client to verify the server, the server also verifies the > client's certificate is real, and the client stores its own certificate in > its keystore. > However, the current implementation for LDAP over SSL does not seem to be > correct in that it only configures keystore but no truststore (so LDAP server > can verify Hadoop's certificate, but Hadoop may not be able to verify LDAP > server's certificate) > I think there should an extra pair of properties to specify the > truststore/password for LDAP server, and use that to configure system > properties {{javax.net.ssl.trustStore}}/{{javax.net.ssl.trustStorePassword}} > I am a security layman so my words can be imprecise. But I hope this makes > sense. > Oracle's SSL LDAP documentation: > http://docs.oracle.com/javase/jndi/tutorial/ldap/security/ssl.html > JSSE reference guide: > http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12862) LDAP Group Mapping over SSL can not specify trust store
[ https://issues.apache.org/jira/browse/HADOOP-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419252#comment-16419252 ] Zhe Zhang commented on HADOOP-12862: Thanks [~shv]. +1 on v9 patch. > LDAP Group Mapping over SSL can not specify trust store > --- > > Key: HADOOP-12862 > URL: https://issues.apache.org/jira/browse/HADOOP-12862 > Project: Hadoop Common > Issue Type: Bug >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang >Priority: Major > Labels: release-blocker > Attachments: HADOOP-12862.001.patch, HADOOP-12862.002.patch, > HADOOP-12862.003.patch, HADOOP-12862.004.patch, HADOOP-12862.005.patch, > HADOOP-12862.006.patch, HADOOP-12862.007.patch, HADOOP-12862.008.patch, > HADOOP-12862.009.patch > > > In a secure environment, SSL is used to encrypt LDAP request for group > mapping resolution. > We (+[~yoderme], +[~tgrayson]) have found that its implementation is strange. > For information, Hadoop name node, as an LDAP client, talks to a LDAP server > to resolve the group mapping of a user. In the case of LDAP over SSL, a > typical scenario is to establish one-way authentication (the client verifies > the server's certificate is real) by storing the server's certificate in the > client's truststore. > A rarer scenario is to establish two-way authentication: in addition to store > truststore for the client to verify the server, the server also verifies the > client's certificate is real, and the client stores its own certificate in > its keystore. > However, the current implementation for LDAP over SSL does not seem to be > correct in that it only configures keystore but no truststore (so LDAP server > can verify Hadoop's certificate, but Hadoop may not be able to verify LDAP > server's certificate) > I think there should an extra pair of properties to specify the > truststore/password for LDAP server, and use that to configure system > properties {{javax.net.ssl.trustStore}}/{{javax.net.ssl.trustStorePassword}} > I am a security layman so my words can be imprecise. But I hope this makes > sense. > Oracle's SSL LDAP documentation: > http://docs.oracle.com/javase/jndi/tutorial/ldap/security/ssl.html > JSSE reference guide: > http://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/JSSERefGuide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15322) LDAPGroupMapping search tree base improvement
[ https://issues.apache.org/jira/browse/HADOOP-15322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-15322: --- Fix Version/s: (was: 2.7.6) > LDAPGroupMapping search tree base improvement > - > > Key: HADOOP-15322 > URL: https://issues.apache.org/jira/browse/HADOOP-15322 > Project: Hadoop Common > Issue Type: Improvement > Components: common, security >Affects Versions: 2.7.4 >Reporter: Ganesh >Priority: Major > > Currently the same ldap base is used for searching posixAccount and > posixGroup. This request is to make a separate base for each container (ie > posixAccount and posixGroup container) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-15322) LDAPGroupMapping search tree base improvement
[ https://issues.apache.org/jira/browse/HADOOP-15322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-15322: --- Component/s: security > LDAPGroupMapping search tree base improvement > - > > Key: HADOOP-15322 > URL: https://issues.apache.org/jira/browse/HADOOP-15322 > Project: Hadoop Common > Issue Type: Improvement > Components: common, security >Affects Versions: 2.7.4 >Reporter: Ganesh >Priority: Major > > Currently the same ldap base is used for searching posixAccount and > posixGroup. This request is to make a separate base for each container (ie > posixAccount and posixGroup container) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14742) Document multi-URI replication Inode for ViewFS
[ https://issues.apache.org/jira/browse/HADOOP-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271339#comment-16271339 ] Zhe Zhang commented on HADOOP-14742: Looks like the Description of HADOOP-12077 is a good starting point? > Document multi-URI replication Inode for ViewFS > --- > > Key: HADOOP-14742 > URL: https://issues.apache.org/jira/browse/HADOOP-14742 > Project: Hadoop Common > Issue Type: Task > Components: documentation, viewfs >Affects Versions: 3.0.0-beta1 >Reporter: Chris Douglas > > HADOOP-12077 added client-side "replication" capabilities to ViewFS. Its > semantics and configuration should be documented. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14732) ProtobufRpcEngine should use Time.monotonicNow to measure durations
[ https://issues.apache.org/jira/browse/HADOOP-14732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14732: --- Fix Version/s: (was: 3.0.0-beta1) (was: 2.9.0) > ProtobufRpcEngine should use Time.monotonicNow to measure durations > --- > > Key: HADOOP-14732 > URL: https://issues.apache.org/jira/browse/HADOOP-14732 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru > Attachments: HADOOP-14732.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14214: --- Fix Version/s: 2.7.4 > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118)
[jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await()
[ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101756#comment-16101756 ] Zhe Zhang commented on HADOOP-14214: Thanks [~liuml07] for the fix. Indeed a major bug. I just backported to branch-2.7. > DomainSocketWatcher::add()/delete() should not self interrupt while looping > await() > --- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Components: hdfs-client >Reporter: Mingliang Liu >Assignee: Mingliang Liu >Priority: Critical > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HADOOP-14214.000.patch > > > Our hive team found a TPCDS job whose queries running on LLAP seem to be > getting stuck. Dozens of threads were waiting for the > {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling > {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with > InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > >
[jira] [Updated] (HADOOP-10829) Iteration on CredentialProviderFactory.serviceLoader is thread-unsafe
[ https://issues.apache.org/jira/browse/HADOOP-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10829: --- Fix Version/s: 2.8.3 2.7.4 Thanks for the fix [~benoyantony]. Given this is a security bug fix, I just backported to branch-2.8 and branch-2.7 > Iteration on CredentialProviderFactory.serviceLoader is thread-unsafe > -- > > Key: HADOOP-10829 > URL: https://issues.apache.org/jira/browse/HADOOP-10829 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.6.0 >Reporter: Benoy Antony >Assignee: Benoy Antony > Labels: BB2015-05-TBR > Fix For: 2.9.0, 2.7.4, 3.0.0-beta1, 2.8.3 > > Attachments: HADOOP-10829.003.patch, HADOOP-10829.patch, > HADOOP-10829.patch > > > CredentialProviderFactory uses _ServiceLoader_ framework to load > _CredentialProviderFactory_ > {code} > private static final ServiceLoader serviceLoader > = > ServiceLoader.load(CredentialProviderFactory.class); > {code} > The _ServiceLoader_ framework does lazy initialization of services which > makes it thread unsafe. If accessed from multiple threads, it is better to > synchronize the access. > Similar synchronization has been done while loading compression codec > providers via HADOOP-8406. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14599) RPC queue time metrics omit timed out clients
[ https://issues.apache.org/jira/browse/HADOOP-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16068571#comment-16068571 ] Zhe Zhang commented on HADOOP-14599: Thanks for working on this [~aramesh2]. Could you re-upload the patch and name it HADOOP-14599..patch? Otherwise Jenkins won't work. See https://wiki.apache.org/hadoop/HowToContribute#Naming_your_patch > RPC queue time metrics omit timed out clients > - > > Key: HADOOP-14599 > URL: https://issues.apache.org/jira/browse/HADOOP-14599 > Project: Hadoop Common > Issue Type: Bug > Components: metrics, rpc-server >Affects Versions: 2.7.0 >Reporter: Ashwin Ramesh >Assignee: Ashwin Ramesh > Attachments: HADOOP_14599.patch > > > RPC average queue time metrics will now update even if the client who made > the call timed out while the call was in the call queue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063791#comment-16063791 ] Zhe Zhang commented on HADOOP-14502: Sorry, my bad. > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Labels: Incompatible > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, > HADOOP-14502.002.patch > > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14502: --- Resolution: Fixed Fix Version/s: 3.0.0-alpha4 Status: Resolved (was: Patch Available) > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Labels: Incompatible > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, > HADOOP-14502.002.patch > > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14502: --- Hadoop Flags: Incompatible change,Reviewed (was: Incompatible change) Thanks Erik! +1 on v2 patch as well. Tested with {{MiniHadoopClusterManager}} and it shows desired behavior. {code} }, { "name" : "Hadoop:service=NameNode,name=NameNodeActivity", "modelerType" : "NameNodeActivity", "tag.ProcessName" : "NameNode", "tag.SessionId" : null, "tag.Context" : "dfs", "tag.Hostname" : "zezhang-mn1", "CreateFileOps" : 2, "FilesCreated" : 12, "FilesAppended" : 0, "GetBlockLocations" : 0, "FilesRenamed" : 0, "FilesTruncated" : 0, "GetListingOps" : 1, "DeleteFileOps" : 0, "FilesDeleted" : 0, "FileInfoOps" : 6, "AddBlockOps" : 2, "GetAdditionalDatanodeOps" : 0, "CreateSymlinkOps" : 0, "GetLinkTargetOps" : 0, "FilesInGetListingOps" : 0, "AllowSnapshotOps" : 0, "DisallowSnapshotOps" : 0, "CreateSnapshotOps" : 0, "DeleteSnapshotOps" : 0, "RenameSnapshotOps" : 0, "ListSnapshottableDirOps" : 0, "SnapshotDiffReportOps" : 0, "BlockReceivedAndDeletedOps" : 2, "BlockOpsQueued" : 1, "BlockOpsBatched" : 0, "TransactionsNumOps" : 24, "TransactionsAvgTime" : 1.7083, "SyncsNumOps" : 14, "SyncsAvgTime" : 0.2857142857142857, "TransactionsBatchedInSync" : 10, "StorageBlockReportNumOps" : 2, "StorageBlockReportAvgTime" : 3.5, "CacheReportNumOps" : 0, "CacheReportAvgTime" : 0.0, "GenerateEDEKTimeNumOps" : 0, "GenerateEDEKTimeAvgTime" : 0.0, "WarmUpEDEKTimeNumOps" : 0, "WarmUpEDEKTimeAvgTime" : 0.0, "ResourceCheckTimeNumOps" : 8, "ResourceCheckTimeAvgTime" : 0.0, "SafeModeTime" : 1, "FsImageLoadTime" : 76, "GetEditNumOps" : 0, "GetEditAvgTime" : 0.0, "GetImageNumOps" : 0, "GetImageAvgTime" : 0.0, "PutImageNumOps" : 0, "PutImageAvgTime" : 0.0, "TotalFileOps" : 11 }, {code} I'm committing to trunk soon. Let's write a short release note? > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Labels: Incompatible > Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, > HADOOP-14502.002.patch > > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14502: --- Hadoop Flags: Incompatible change > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Labels: Incompatible > Attachments: HADOOP-14502.000.patch, HADOOP-14502.001.patch, > HADOOP-14502.002.patch > > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14440) Add metrics for connections dropped
[ https://issues.apache.org/jira/browse/HADOOP-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14440: --- Fix Version/s: 2.7.4 Thanks for the work [~ebadger]. I think this is a good improvement for 2.7.4; just backported to branch-2.7. > Add metrics for connections dropped > --- > > Key: HADOOP-14440 > URL: https://issues.apache.org/jira/browse/HADOOP-14440 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Eric Badger >Assignee: Eric Badger > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 > > Attachments: HADOOP-14440.001.patch, HADOOP-14440.002.patch, > HADOOP-14440.003.patch > > > Will be useful to figure out when the NN is getting overloaded with more > connections than it can handle -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041381#comment-16041381 ] Zhe Zhang edited comment on HADOOP-14502 at 6/7/17 6:32 PM: bq. Can we make StorageBlockReport a MutableRate metric and remove NameNodeActivity#BlockReport metric This sounds good to me (as a 3.0 change). Pinging [~andrew.wang] for opinion on breaking compatibility in this case. was (Author: zhz): bq. Can we make StorageBlockReport a MutableRate metric and remove NameNodeActivity#BlockReport metric This sounds good to me (as a 3.0 change). > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Priority: Minor > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14502) Confusion/name conflict between NameNodeActivity#BlockReportNumOps and RpcDetailedActivity#BlockReportNumOps
[ https://issues.apache.org/jira/browse/HADOOP-14502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16041381#comment-16041381 ] Zhe Zhang commented on HADOOP-14502: bq. Can we make StorageBlockReport a MutableRate metric and remove NameNodeActivity#BlockReport metric This sounds good to me (as a 3.0 change). > Confusion/name conflict between NameNodeActivity#BlockReportNumOps and > RpcDetailedActivity#BlockReportNumOps > > > Key: HADOOP-14502 > URL: https://issues.apache.org/jira/browse/HADOOP-14502 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Priority: Minor > > Currently the {{BlockReport(NumOps|AvgTime)}} metrics emitted under the > {{RpcDetailedActivity}} context and those emitted under the > {{NameNodeActivity}} context are actually reporting different things despite > having the same name. {{NameNodeActivity}} reports the count/time of _per > storage_ block reports, whereas {{RpcDetailedActivity}} reports the > count/time of _per datanode_ block reports. This makes for a confusing > experience with two metrics having the same name reporting different values. > We already have the {{StorageBlockReportsOps}} metric under > {{NameNodeActivity}}. Can we make {{StorageBlockReport}} a {{MutableRate}} > metric and remove {{NameNodeActivity#BlockReport}} metric? Open to other > suggestions about how to address this as well. The 3.0 release seems a good > time to make this incompatible change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13433) Race in UGI.reloginFromKeytab
[ https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15989249#comment-15989249 ] Zhe Zhang commented on HADOOP-13433: [~daryn] Any chance you can upload the internal fix? I'll be very happy to help review. Thanks. > Race in UGI.reloginFromKeytab > - > > Key: HADOOP-13433 > URL: https://issues.apache.org/jira/browse/HADOOP-13433 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.9.0, 2.7.4, 2.6.6, 2.8.1, 3.0.0-alpha3 > > Attachments: HADOOP-13433-branch-2.7.patch, > HADOOP-13433-branch-2.7-v1.patch, HADOOP-13433-branch-2.7-v2.patch, > HADOOP-13433-branch-2.8.patch, HADOOP-13433-branch-2.8.patch, > HADOOP-13433-branch-2.8-v1.patch, HADOOP-13433-branch-2.patch, > HADOOP-13433.patch, HADOOP-13433-v1.patch, HADOOP-13433-v2.patch, > HADOOP-13433-v4.patch, HADOOP-13433-v5.patch, HADOOP-13433-v6.patch, > HBASE-13433-testcase-v3.patch > > > This is a problem that has troubled us for several years. For our HBase > cluster, sometimes the RS will be stuck due to > {noformat} > 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: The ticket > isn't for us (35) - BAD TGS SERVER NAME)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194) > at > org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781) > at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37) > at org.apache.hadoop.hbase.security.User.call(User.java:607) > at org.apache.hadoop.hbase.security.User.access$700(User.java:51) > at > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461) > at > org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004) > at > org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107) > at $Proxy24.replicateLogEntries(Unknown Source) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515) > Caused by: GSSException: No valid credentials provided (Mechanism level: The > ticket isn't for us (35) - BAD TGS SERVER NAME) > at > sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180) > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175) > ... 23 more > Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME > at sun.security.krb5.KrbTgsRep.(KrbTgsRep.java:64) > at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185) > at > sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294) > at > sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106) > at > sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557) > at >
[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer
[ https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14276: --- Resolution: Fixed Fix Version/s: 2.8.1 2.7.4 2.9.0 Status: Resolved (was: Patch Available) Committed all the way to branch-2.7. Thanks Erik for the work. > Add a nanosecond API to Time/Timer/FakeTimer > > > Key: HADOOP-14276 > URL: https://issues.apache.org/jira/browse/HADOOP-14276 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3 > > Attachments: HADOOP-14276.000.patch > > > Right now {{Time}}/{{Timer}} export functionality for retrieving time at a > millisecond-level precision but not at a nanosecond-level precision, which is > required for some applications (there's ~70 usages). Most of these seem not > to need mocking functionality for tests; only one class currently mocks this > out ({{LightWeightCache}}) but we would like to add another as part of > HDFS-11615 and want to avoid code duplication. This could be useful for other > classes in the future as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer
[ https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14276: --- Fix Version/s: 3.0.0-alpha3 Committed to trunk. YARN-6288 is breaking branch-2 build, waiting for the fix. > Add a nanosecond API to Time/Timer/FakeTimer > > > Key: HADOOP-14276 > URL: https://issues.apache.org/jira/browse/HADOOP-14276 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Fix For: 3.0.0-alpha3 > > Attachments: HADOOP-14276.000.patch > > > Right now {{Time}}/{{Timer}} export functionality for retrieving time at a > millisecond-level precision but not at a nanosecond-level precision, which is > required for some applications (there's ~70 usages). Most of these seem not > to need mocking functionality for tests; only one class currently mocks this > out ({{LightWeightCache}}) but we would like to add another as part of > HDFS-11615 and want to avoid code duplication. This could be useful for other > classes in the future as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer
[ https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14276: --- Hadoop Flags: Reviewed Thanks for confirming Liang. I'll commit the patch to trunk~branch-2.7 soon. > Add a nanosecond API to Time/Timer/FakeTimer > > > Key: HADOOP-14276 > URL: https://issues.apache.org/jira/browse/HADOOP-14276 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Attachments: HADOOP-14276.000.patch > > > Right now {{Time}}/{{Timer}} export functionality for retrieving time at a > millisecond-level precision but not at a nanosecond-level precision, which is > required for some applications (there's ~70 usages). Most of these seem not > to need mocking functionality for tests; only one class currently mocks this > out ({{LightWeightCache}}) but we would like to add another as part of > HDFS-11615 and want to avoid code duplication. This could be useful for other > classes in the future as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14276) Add a nanosecond API to Time/Timer/FakeTimer
[ https://issues.apache.org/jira/browse/HADOOP-14276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15959573#comment-15959573 ] Zhe Zhang commented on HADOOP-14276: Thanks Erik. The analysis makes sense and +1 on the patch. Will wait for 2 hours before committing. > Add a nanosecond API to Time/Timer/FakeTimer > > > Key: HADOOP-14276 > URL: https://issues.apache.org/jira/browse/HADOOP-14276 > Project: Hadoop Common > Issue Type: Improvement > Components: util >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Attachments: HADOOP-14276.000.patch > > > Right now {{Time}}/{{Timer}} export functionality for retrieving time at a > millisecond-level precision but not at a nanosecond-level precision, which is > required for some applications (there's ~70 usages). Most of these seem not > to need mocking functionality for tests; only one class currently mocks this > out ({{LightWeightCache}}) but we would like to add another as part of > HDFS-11615 and want to avoid code duplication. This could be useful for other > classes in the future as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14211) FilterFs and ChRootedFs are too aggressive about enforcing "authorityNeeded"
[ https://issues.apache.org/jira/browse/HADOOP-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-14211: --- Fix Version/s: 2.8.1 2.7.4 Thanks [~xkrogen] for the work and [~andrew.wang] for the review. +1 on the patch as well. I just cherry-picked to branch-2.8 and branch-2.7. > FilterFs and ChRootedFs are too aggressive about enforcing "authorityNeeded" > > > Key: HADOOP-14211 > URL: https://issues.apache.org/jira/browse/HADOOP-14211 > Project: Hadoop Common > Issue Type: Bug > Components: viewfs >Affects Versions: 2.6.0 >Reporter: Erik Krogen >Assignee: Erik Krogen > Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3 > > Attachments: HADOOP-14211.000.patch, HADOOP-14211.001.patch > > > Right now {{FilterFs}} and {{ChRootedFs}} pass the following up to the > {{AbstractFileSystem}} superconstructor: > {code} > super(fs.getUri(), fs.getUri().getScheme(), > fs.getUri().getAuthority() != null, fs.getUriDefaultPort()); > {code} > This passes a value of {{authorityNeeded==true}} for any {{fs}} which has an > authority, but this isn't necessarily the case - ViewFS is an example of > this. You will encounter this issue if you try to filter a ViewFS, or nest > one ViewFS within another. The {{authorityNeeded}} check isn't necessary in > this case anyway; {{fs}} is already an instantiated {{AbstractFileSystem}} > which means it has already used the same constructor with the value of > {{authorityNeeded}} (and corresponding validation) that it actually requires. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults
[ https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-9631: -- Resolution: Fixed Fix Version/s: 3.0.0-alpha3 2.8.1 2.7.4 2.9.0 Status: Resolved (was: Patch Available) Committed to above mentioned branches. Thanks Lohit and Erik for the contribution! > ViewFs should use underlying FileSystem's server side defaults > -- > > Key: HADOOP-9631 > URL: https://issues.apache.org/jira/browse/HADOOP-9631 > Project: Hadoop Common > Issue Type: Bug > Components: fs, viewfs >Affects Versions: 2.0.4-alpha >Reporter: Lohit Vijayarenu >Assignee: Erik Krogen > Labels: BB2015-05-TBR > Fix For: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3 > > Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, > HADOOP-9631.007.patch, HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, > HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java > > > On a cluster with ViewFS as default FileSystem, creating files using > FileContext will always result with replication factor of 1, instead of > underlying filesystem default (like HDFS) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults
[ https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-9631: -- Hadoop Flags: Reviewed Target Version/s: 2.9.0, 2.7.4, 2.8.1, 3.0.0-alpha3 Thanks [~xkrogen] for the update! +1 on v7 patch. I just committed to trunk, working on backports. I think this should go into branch-2, branch-2.8, and branch-2.7. > ViewFs should use underlying FileSystem's server side defaults > -- > > Key: HADOOP-9631 > URL: https://issues.apache.org/jira/browse/HADOOP-9631 > Project: Hadoop Common > Issue Type: Bug > Components: fs, viewfs >Affects Versions: 2.0.4-alpha >Reporter: Lohit Vijayarenu >Assignee: Erik Krogen > Labels: BB2015-05-TBR > Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, > HADOOP-9631.007.patch, HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, > HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java > > > On a cluster with ViewFS as default FileSystem, creating files using > FileContext will always result with replication factor of 1, instead of > underlying filesystem default (like HDFS) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-9631) ViewFs should use underlying FileSystem's server side defaults
[ https://issues.apache.org/jira/browse/HADOOP-9631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937454#comment-15937454 ] Zhe Zhang commented on HADOOP-9631: --- Thanks [~xkrogen] for the work! v6 patch LGTM, with a couple of minor comments: # {{ViewFs#getServerDefaults}} has unnecessary exceptions in signature # Can we enhance the test to cover the case where {{ViewFs#getServerDefaults(f)}} where f is an internal dir? > ViewFs should use underlying FileSystem's server side defaults > -- > > Key: HADOOP-9631 > URL: https://issues.apache.org/jira/browse/HADOOP-9631 > Project: Hadoop Common > Issue Type: Bug > Components: fs, viewfs >Affects Versions: 2.0.4-alpha >Reporter: Lohit Vijayarenu >Assignee: Erik Krogen > Labels: BB2015-05-TBR > Attachments: HADOOP-9631.005.patch, HADOOP-9631.006.patch, > HADOOP-9631.trunk.1.patch, HADOOP-9631.trunk.2.patch, > HADOOP-9631.trunk.3.patch, HADOOP-9631.trunk.4.patch, TestFileContext.java > > > On a cluster with ViewFS as default FileSystem, creating files using > FileContext will always result with replication factor of 1, instead of > underlying filesystem default (like HDFS) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Resolved] (HADOOP-14147) Offline Image Viewer bug
[ https://issues.apache.org/jira/browse/HADOOP-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang resolved HADOOP-14147. Resolution: Duplicate > Offline Image Viewer bug > - > > Key: HADOOP-14147 > URL: https://issues.apache.org/jira/browse/HADOOP-14147 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: gehaijiang > > $ hdfs oiv -p Delimited -i fsimage_13752447421 -o fsimage.xml > 17/03/04 08:40:22 INFO offlineImageViewer.FSImageHandler: Loading 757 strings > 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading > directories > 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading > directories in INode section. > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Found 4374109 > directories in INode section. > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Finished loading > directories in 96798ms > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Loading INode > directory section. > Exception in thread "main" java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.buildNamespace(PBImageTextWriter.java:570) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.loadINodeDirSection(PBImageTextWriter.java:522) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.visit(PBImageTextWriter.java:460) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.visit(PBImageDelimitedTextWriter.java:46) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:182) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:124) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Reopened] (HADOOP-14147) Offline Image Viewer bug
[ https://issues.apache.org/jira/browse/HADOOP-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang reopened HADOOP-14147: > Offline Image Viewer bug > - > > Key: HADOOP-14147 > URL: https://issues.apache.org/jira/browse/HADOOP-14147 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: gehaijiang > > $ hdfs oiv -p Delimited -i fsimage_13752447421 -o fsimage.xml > 17/03/04 08:40:22 INFO offlineImageViewer.FSImageHandler: Loading 757 strings > 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading > directories > 17/03/04 08:40:22 INFO offlineImageViewer.PBImageTextWriter: Loading > directories in INode section. > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Found 4374109 > directories in INode section. > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Finished loading > directories in 96798ms > 17/03/04 08:41:59 INFO offlineImageViewer.PBImageTextWriter: Loading INode > directory section. > Exception in thread "main" java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:129) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.buildNamespace(PBImageTextWriter.java:570) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.loadINodeDirSection(PBImageTextWriter.java:522) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageTextWriter.visit(PBImageTextWriter.java:460) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.PBImageDelimitedTextWriter.visit(PBImageDelimitedTextWriter.java:46) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:182) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:124) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files
[ https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869273#comment-15869273 ] Zhe Zhang commented on HADOOP-14086: Thanks Zheng. This will be a very useful improvement. Any idea how to reduce NN workload? At the end of the day, if we distcp 1M files we need to call 1M {{getFileInfo}}.. We thought about querying the SbNN but haven't investigated too far. > Improve DistCp Speed for small files > > > Key: HADOOP-14086 > URL: https://issues.apache.org/jira/browse/HADOOP-14086 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 2.6.5 >Reporter: Zheng Shao >Assignee: Zheng Shao >Priority: Minor > > When using distcp to copy lots of small files, NameNode naturally becomes a > bottleneck. > The current distcp code did *not* optimize to reduce the NameNode calls. We > should restructure the code to reduce the number of NameNode calls as much as > possible to speed up the copy of small files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13742) Expose "NumOpenConnectionsPerUser" as a metric
[ https://issues.apache.org/jira/browse/HADOOP-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13742: --- Fix Version/s: 2.7.4 Thanks Brahma for the work! I think this is a good improvement for branch-2.7 as well. I just did the backport. > Expose "NumOpenConnectionsPerUser" as a metric > -- > > Key: HADOOP-13742 > URL: https://issues.apache.org/jira/browse/HADOOP-13742 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-13742-002.patch, HADOOP-13742-003.patch, > HADOOP-13742-004.patch, HADOOP-13742-005.patch, HADOOP-13742-006.patch, > HADOOP-13742.patch > > > To track user level connections( How many connections for each user) in busy > cluster where so many connections to server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13782: --- Resolution: Fixed Fix Version/s: 3.0.0-alpha2 2.7.4 2.8.0 Status: Resolved (was: Patch Available) I just committed the patch to trunk~branch-2.7. Thanks Erik for the contribution! > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, > HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, > HADOOP-13782.005.patch, HADOOP-13782.006.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13782: --- Hadoop Flags: Reviewed > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, > HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, > HADOOP-13782.005.patch, HADOOP-13782.006.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649117#comment-15649117 ] Zhe Zhang commented on HADOOP-13782: Thanks Erik! +1 on v6 patch pending Jenkins. > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, > HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch, > HADOOP-13782.005.patch, HADOOP-13782.006.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648515#comment-15648515 ] Zhe Zhang commented on HADOOP-13782: Thanks Erik for the update! With HADOOP-13804 change the new class is much cleaner :) The current concurrency model is still a little complicated. {{snapshot}} has a nested synchronization on {{globalMetrics}} and {{stat}}, where {{stat}} is a local variable. Maybe we can simplify the concurrency model by: # Make {{globalMetrics}} a ConcurrentMap # Do we want to support multiple threads doing {{snapshot}} at the same time? If not, we should probably make it a synchronized method so it's easier to maintain and reason about # Maybe creating a concurrent version of {{SampleStat}}, because that's the only object we want to protect from concurrent updating (local thread adding, and the snapshotting thread resetting). {code} private class ConcurrentSampleStat extends SampleStat { @Override public synchronized void reset(){ super.reset(); } @Override public synchronized SampleStat add(double x) { return super.add(x); } } {code} # {{threadLocalMetricsMap}} can be a regular instead of concurrent map? Also, IIUC, {{snapshot}} is supposed to clear all metrics from the last window. In the v4 patch, if a certain type of metrics appeared in the last window but disappears in the current window (e.g. thread dies), the entry in {{globalMetrics}} is not cleared. > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, > HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13782: --- Target Version/s: 2.7.4 > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch, > HADOOP-13782.002.patch, HADOOP-13782.003.patch, HADOOP-13782.004.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13804) MutableStat mean loses accuracy if add(long, long) is used
[ https://issues.apache.org/jira/browse/HADOOP-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13804: --- Resolution: Fixed Fix Version/s: 3.0.0-alpha2 2.7.4 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk~branch-2.7. Thanks Erik for the contribution. > MutableStat mean loses accuracy if add(long, long) is used > -- > > Key: HADOOP-13804 > URL: https://issues.apache.org/jira/browse/HADOOP-13804 > Project: Hadoop Common > Issue Type: Bug > Components: metrics >Affects Versions: 2.6.5 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-13804.000.patch > > > Currently if the {{MutableStat.add(long numSamples, long sum)}} method is > used with a large sample count, the mean that is returned will be very > inaccurate. This is a result of using the Welford method for variance > calculation, which assumes that each sample is processed on its own, to > calculate the mean as well. For variance this is fine, since variance numbers > lose meaning if you add many samples at once, but the mean should still be > accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13804) MutableStat mean loses accuracy if add(long, long) is used
[ https://issues.apache.org/jira/browse/HADOOP-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13804: --- Target Version/s: 2.7.4 Hadoop Flags: Reviewed Thanks Erik for the fix. The patch LGTM; +1 pending Jenkins. Result of the added test without the change: {code} java.lang.AssertionError: Bad value for metric TestAvgVal Expected :1.5 Actual :1.9995 {code} > MutableStat mean loses accuracy if add(long, long) is used > -- > > Key: HADOOP-13804 > URL: https://issues.apache.org/jira/browse/HADOOP-13804 > Project: Hadoop Common > Issue Type: Bug > Components: metrics >Affects Versions: 2.6.5 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Minor > Attachments: HADOOP-13804.000.patch > > > Currently if the {{MutableStat.add(long numSamples, long sum)}} method is > used with a large sample count, the mean that is returned will be very > inaccurate. This is a result of using the Welford method for variance > calculation, which assumes that each sample is processed on its own, to > calculate the mean as well. For variance this is fine, since variance numbers > lose meaning if you add many samples at once, but the mean should still be > accurate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13782) Make MutableRates metrics thread-local write, aggregate-on-read
[ https://issues.apache.org/jira/browse/HADOOP-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645199#comment-15645199 ] Zhe Zhang commented on HADOOP-13782: Thanks Erik for the patch. LGTM overall. A few detailed comments: # It'd be ideal if we can simplify the two internal classes {{LocalMutableRate}} and {{MutableRateInternal}}, and also better fit them with existing {{MutableStat}} or {{MutableRate}} classes. We discussed offline an issue in existing {{MutableStat}} batch add method around {{intervalStat}}. I think we should document the issue so other developers understand the motivation of creating a simpler rate class. # The below synchronization behavior is different than {{MutableStat}}, where both {{snapshot}} and {{add}} methods are {{synchronized}}. Should we allow thread-local {{add}} while one thread is doing {{snapshot}}? {code} @Override public void snapshot(MetricsRecordBuilder rb, boolean all) { synchronized (globalMetrics) { {code} # Maybe we should comment below that we will be doing aggregation (a main logic in this class) {code} } else { for (Map.Entryent : map.entrySet()) { {code} # Cosmetic: since {{getLocalMetrics}} is short and is only used by {{add}} (which itself is short), can we merge the two methods? # Cosmetic: as a follow-on we can consider consolidating the old {{MutableRates}} and the new {{MutableRatesWithAggregation}} to reduce duplication > Make MutableRates metrics thread-local write, aggregate-on-read > --- > > Key: HADOOP-13782 > URL: https://issues.apache.org/jira/browse/HADOOP-13782 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Erik Krogen >Assignee: Erik Krogen > Attachments: HADOOP-13782.000.patch, HADOOP-13782.001.patch > > > Currently the {{MutableRates}} metrics class serializes all writes to metrics > it contains because of its use of {{MetricsRegistry.add()}} (i.e., even two > increments of unrelated metrics contained within the same {{MutableRates}} > object will serialize w.r.t. each other). This class is used by > {{RpcDetailedMetrics}}, which may have many hundreds of threads contending to > modify these metrics. Instead we should allow updates to unrelated metrics > objects to happen concurrently. To do so we can let each thread locally > collect metrics, and on a {{snapshot}}, aggregate the metrics from all of the > threads. > I have collected some benchmark performance numbers in HADOOP-13747 > (https://issues.apache.org/jira/secure/attachment/12835043/benchmark_results) > which indicate that this can bring significantly higher performance in high > contention situations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12483) Maintain wrapped SASL ordering for postponed IPC responses
[ https://issues.apache.org/jira/browse/HADOOP-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12483: --- Fix Version/s: 2.7.4 I backported this bug fix to branch-2.7 since I just backported HADOOP-10300. > Maintain wrapped SASL ordering for postponed IPC responses > -- > > Key: HADOOP-12483 > URL: https://issues.apache.org/jira/browse/HADOOP-12483 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Affects Versions: 2.8.0 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-12483.patch > > > A SASL encryption algorithm (wrapping) may have a required ordering for > encrypted responses. The IPC layer encrypts when the response is set based > on the assumption it is being immediately sent. Postponed responses violate > that assumption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10300: --- Resolution: Fixed Fix Version/s: 2.7.4 Status: Resolved (was: Patch Available) I just committed the patch to branch-2.7. > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10300: --- Attachment: (was: HADOOP-10300-branch-2.7.0.patch) > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10300: --- Attachment: HADOOP-10300-branch-2.7.patch > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10300: --- Attachment: HADOOP-10300-branch-2.7.0.patch > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Reopened] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang reopened HADOOP-10300: I think this'd be a good addition to branch-2.7; all other subtasks under the umbrella JIRA are actually in 2.3. Attaching a branch-2.7 patch to trigger Jenkins. [~daryn] [~kihwal] LMK if you have any concerns about the backport. > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 2.0.0-alpha, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10300) Allowed deferred sending of call responses
[ https://issues.apache.org/jira/browse/HADOOP-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10300: --- Status: Patch Available (was: Reopened) > Allowed deferred sending of call responses > -- > > Key: HADOOP-10300 > URL: https://issues.apache.org/jira/browse/HADOOP-10300 > Project: Hadoop Common > Issue Type: Sub-task > Components: ipc >Affects Versions: 3.0.0-alpha1, 2.0.0-alpha >Reporter: Daryn Sharp >Assignee: Daryn Sharp > Labels: BB2015-05-TBR > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10300-branch-2.7.0.patch, HADOOP-10300.patch, > HADOOP-10300.patch, HADOOP-10300.patch > > > RPC handlers currently do not return until the RPC call completes and > response is sent, or a partially sent response has been queued for the > responder. It would be useful for a proxy method to notify the handler to > not yet the send the call's response. > An potential use case is a namespace handler in the NN might want to return > before the edit log is synced so it can service more requests and allow > increased batching of edits per sync. Background syncing could later trigger > the sending of the call response to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs
[ https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12325: --- Resolution: Fixed Fix Version/s: 2.7.4 Status: Resolved (was: Patch Available) I verified test failures and pushed to branch-2.7. > RPC Metrics : Add the ability track and log slow RPCs > - > > Key: HADOOP-12325 > URL: https://issues.apache.org/jira/browse/HADOOP-12325 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc, metrics >Affects Versions: 2.7.1 >Reporter: Anu Engineer >Assignee: Anu Engineer > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: Callers of WritableRpcEngine.call.png, > HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, > HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, > HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch > > > This JIRA proposes to add a counter called RpcSlowCalls and also a > configuration setting that allows users to log really slow RPCs. Slow RPCs > are RPCs that fall at 99th percentile. This is useful to troubleshoot why > certain services like name node freezes under heavy load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10597: --- Resolution: Fixed Status: Resolved (was: Patch Available) > RPC Server signals backoff to clients when all request queues are full > -- > > Key: HADOOP-10597 > URL: https://issues.apache.org/jira/browse/HADOOP-10597 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, > HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, > HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, > MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf > > > Currently if an application hits NN too hard, RPC requests be in blocking > state, assuming OS connection doesn't run out. Alternatively RPC or NN can > throw some well defined exception back to the client based on certain > policies when it is under heavy load; client will understand such exception > and do exponential back off, as another implementation of > RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10597: --- Fix Version/s: 2.7.4 Thanks Ming for confirming this. I verified reported test failures (cannot reproduce locally) and pushed to branch-2.7. > RPC Server signals backoff to clients when all request queues are full > -- > > Key: HADOOP-10597 > URL: https://issues.apache.org/jira/browse/HADOOP-10597 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, > HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, > HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, > MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf > > > Currently if an application hits NN too hard, RPC requests be in blocking > state, assuming OS connection doesn't run out. Alternatively RPC or NN can > throw some well defined exception back to the client based on certain > policies when it is under heavy load; client will understand such exception > and do exponential back off, as another implementation of > RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10597: --- Status: Patch Available (was: Reopened) > RPC Server signals backoff to clients when all request queues are full > -- > > Key: HADOOP-10597 > URL: https://issues.apache.org/jira/browse/HADOOP-10597 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, > HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, > HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, > MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf > > > Currently if an application hits NN too hard, RPC requests be in blocking > state, assuming OS connection doesn't run out. Alternatively RPC or NN can > throw some well defined exception back to the client based on certain > policies when it is under heavy load; client will understand such exception > and do exponential back off, as another implementation of > RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-10597: --- Attachment: HADOOP-10597-branch-2.7.patch > RPC Server signals backoff to clients when all request queues are full > -- > > Key: HADOOP-10597 > URL: https://issues.apache.org/jira/browse/HADOOP-10597 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, > HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, > HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, > MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf > > > Currently if an application hits NN too hard, RPC requests be in blocking > state, assuming OS connection doesn't run out. Alternatively RPC or NN can > throw some well defined exception back to the client based on certain > policies when it is under heavy load; client will understand such exception > and do exponential back off, as another implementation of > RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Reopened] (HADOOP-10597) RPC Server signals backoff to clients when all request queues are full
[ https://issues.apache.org/jira/browse/HADOOP-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang reopened HADOOP-10597: Thanks [~mingma] for the nice work! Since the umbrella JIRA HADOOP-9640 (and the FairCallQueue feature) is available in 2.7 and 2.6, I think it makes sense to backport this to at least 2.7. Reopening to test branch-2.7 patch. Please let me know if you have any concern about adding this to branch-2.7. > RPC Server signals backoff to clients when all request queues are full > -- > > Key: HADOOP-10597 > URL: https://issues.apache.org/jira/browse/HADOOP-10597 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: HADOOP-10597-2.patch, HADOOP-10597-3.patch, > HADOOP-10597-4.patch, HADOOP-10597-5.patch, HADOOP-10597-6.patch, > HADOOP-10597-branch-2.7.patch, HADOOP-10597.patch, > MoreRPCClientBackoffEvaluation.pdf, RPCClientBackoffDesignAndEvaluation.pdf > > > Currently if an application hits NN too hard, RPC requests be in blocking > state, assuming OS connection doesn't run out. Alternatively RPC or NN can > throw some well defined exception back to the client based on certain > policies when it is under heavy load; client will understand such exception > and do exponential back off, as another implementation of > RetryInvocationHandler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13747) Use LongAdder for more efficient metrics tracking
[ https://issues.apache.org/jira/browse/HADOOP-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609257#comment-15609257 ] Zhe Zhang commented on HADOOP-13747: Thanks Erik! It seems once again the discussion is leading to another JIRA (convert {{MutableRates}} to aggregate on read) :) Benchmark results look good. I imagine the benefits of this optimization will be more significant when the number of thread increases -- e.g. 256 as used in some production clusters. {{MutableRatesWithAggregation}} LGTM overall. The only structural concern I have is the assumption of long-lived threads. Right now {{MutableRates}} is only used by detailed RPC metrics so the assumption still holds. But it might limit its applicability as a general-purpose metrics class. I'm happy to have other people's opinions on this as well (whether we foresee any short-lived threads using {{MutableRates}}). If we do want to support short-lived threads, an alternative is to use a similar idea as {{LongAdder}}, and use a set of variables to hold {{}} tuples. On snapshotting, apply this "log entries" one by one. > Use LongAdder for more efficient metrics tracking > - > > Key: HADOOP-13747 > URL: https://issues.apache.org/jira/browse/HADOOP-13747 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Zhe Zhang >Assignee: Erik Krogen > Attachments: HADOOP-13747.patch, benchmark_results > > > Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, > use a synchronized counter to be updated by all handler threads (multiple > hundreds in large production clusters). As [~andrew.wang] suggested, it'd be > more efficient to use the [LongAdder | > http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co] > library which dynamically create intermediate-result variables. > Assigning to [~xkrogen] who has already done some investigation on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs
[ https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12325: --- Status: Patch Available (was: Reopened) > RPC Metrics : Add the ability track and log slow RPCs > - > > Key: HADOOP-12325 > URL: https://issues.apache.org/jira/browse/HADOOP-12325 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc, metrics >Affects Versions: 2.7.1 >Reporter: Anu Engineer >Assignee: Anu Engineer > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: Callers of WritableRpcEngine.call.png, > HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, > HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, > HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch > > > This JIRA proposes to add a counter called RpcSlowCalls and also a > configuration setting that allows users to log really slow RPCs. Slow RPCs > are RPCs that fall at 99th percentile. This is useful to troubleshoot why > certain services like name node freezes under heavy load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs
[ https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12325: --- Attachment: HADOOP-12325-branch-2.7.00.patch > RPC Metrics : Add the ability track and log slow RPCs > - > > Key: HADOOP-12325 > URL: https://issues.apache.org/jira/browse/HADOOP-12325 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc, metrics >Affects Versions: 2.7.1 >Reporter: Anu Engineer >Assignee: Anu Engineer > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: Callers of WritableRpcEngine.call.png, > HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, > HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, > HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch > > > This JIRA proposes to add a counter called RpcSlowCalls and also a > configuration setting that allows users to log really slow RPCs. Slow RPCs > are RPCs that fall at 99th percentile. This is useful to troubleshoot why > certain services like name node freezes under heavy load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Reopened] (HADOOP-12325) RPC Metrics : Add the ability track and log slow RPCs
[ https://issues.apache.org/jira/browse/HADOOP-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang reopened HADOOP-12325: Sorry to reopen the JIRA. I think it is a good addition to branch-2.7 and want to test branch-2.7 patch. > RPC Metrics : Add the ability track and log slow RPCs > - > > Key: HADOOP-12325 > URL: https://issues.apache.org/jira/browse/HADOOP-12325 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc, metrics >Affects Versions: 2.7.1 >Reporter: Anu Engineer >Assignee: Anu Engineer > Fix For: 2.8.0, 3.0.0-alpha1 > > Attachments: Callers of WritableRpcEngine.call.png, > HADOOP-12325-branch-2.7.00.patch, HADOOP-12325.001.patch, > HADOOP-12325.002.patch, HADOOP-12325.003.patch, HADOOP-12325.004.patch, > HADOOP-12325.005.patch, HADOOP-12325.005.test.patch, HADOOP-12325.006.patch > > > This JIRA proposes to add a counter called RpcSlowCalls and also a > configuration setting that allows users to log really slow RPCs. Slow RPCs > are RPCs that fall at 99th percentile. This is useful to troubleshoot why > certain services like name node freezes under heavy load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13747) Use LongAdder for more efficient metrics tracking
[ https://issues.apache.org/jira/browse/HADOOP-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13747: --- Description: Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, use a synchronized counter to be updated by all handler threads (multiple hundreds in large production clusters). As [~andrew.wang] suggested, it'd be more efficient to use the [LongAdder | http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co] library which dynamically create intermediate-result variables. Assigning to [~xkrogen] who has already done some investigation on this. was: Currently most metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, use a synchronized counter to be updated by all handler threads (multiple hundreds in large production clusters). As [~andrew.wang] suggested, it'd be more efficient to use the [LongAdder | http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co] library which dynamically create intermediate-result variables. Assigning to [~xkrogen] who has already done some investigation on this. > Use LongAdder for more efficient metrics tracking > - > > Key: HADOOP-13747 > URL: https://issues.apache.org/jira/browse/HADOOP-13747 > Project: Hadoop Common > Issue Type: Improvement > Components: metrics >Reporter: Zhe Zhang >Assignee: Erik Krogen > > Currently many metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, > use a synchronized counter to be updated by all handler threads (multiple > hundreds in large production clusters). As [~andrew.wang] suggested, it'd be > more efficient to use the [LongAdder | > http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co] > library which dynamically create intermediate-result variables. > Assigning to [~xkrogen] who has already done some investigation on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13747) Use LongAdder for more efficient metrics tracking
Zhe Zhang created HADOOP-13747: -- Summary: Use LongAdder for more efficient metrics tracking Key: HADOOP-13747 URL: https://issues.apache.org/jira/browse/HADOOP-13747 Project: Hadoop Common Issue Type: Improvement Components: metrics Reporter: Zhe Zhang Assignee: Erik Krogen Currently most metrics, including {{RpcMetrics}} and {{RpcDetailedMetrics}}, use a synchronized counter to be updated by all handler threads (multiple hundreds in large production clusters). As [~andrew.wang] suggested, it'd be more efficient to use the [LongAdder | http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/jsr166e/LongAdder.java?view=co] library which dynamically create intermediate-result variables. Assigning to [~xkrogen] who has already done some investigation on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12259) Utility to Dynamic port allocation
[ https://issues.apache.org/jira/browse/HADOOP-12259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586074#comment-15586074 ] Zhe Zhang commented on HADOOP-12259: Thanks for the work Brahma. This is a good addition to branch-2.7 and I just did the backport. > Utility to Dynamic port allocation > -- > > Key: HADOOP-12259 > URL: https://issues.apache.org/jira/browse/HADOOP-12259 > Project: Hadoop Common > Issue Type: Improvement > Components: test, util >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-12259.patch > > > As per discussion in YARN-3528 and [~rkanter] comment [here | > https://issues.apache.org/jira/browse/YARN-3528?focusedCommentId=14637700=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14637700 > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12259) Utility to Dynamic port allocation
[ https://issues.apache.org/jira/browse/HADOOP-12259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12259: --- Fix Version/s: 2.7.4 > Utility to Dynamic port allocation > -- > > Key: HADOOP-12259 > URL: https://issues.apache.org/jira/browse/HADOOP-12259 > Project: Hadoop Common > Issue Type: Improvement > Components: test, util >Reporter: Brahma Reddy Battula >Assignee: Brahma Reddy Battula > Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1 > > Attachments: HADOOP-12259.patch > > > As per discussion in YARN-3528 and [~rkanter] comment [here | > https://issues.apache.org/jira/browse/YARN-3528?focusedCommentId=14637700=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14637700 > ] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13558) UserGroupInformation created from a Subject incorrectly tries to renew the Kerberos ticket
[ https://issues.apache.org/jira/browse/HADOOP-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567020#comment-15567020 ] Zhe Zhang commented on HADOOP-13558: Thanks much Xiao! > UserGroupInformation created from a Subject incorrectly tries to renew the > Kerberos ticket > -- > > Key: HADOOP-13558 > URL: https://issues.apache.org/jira/browse/HADOOP-13558 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.7.2, 2.6.4, 3.0.0-alpha2 >Reporter: Alejandro Abdelnur >Assignee: Xiao Chen > Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-13558.01.patch, HADOOP-13558.02.patch, > HADOOP-13558.branch-2.7.patch > > > The UGI {{checkTGTAndReloginFromKeytab()}} method checks certain conditions > and if they are met it invokes the {{reloginFromKeytab()}}. The > {{reloginFromKeytab()}} method then fails with an {{IOException}} > "loginUserFromKeyTab must be done first" because there is no keytab > associated with the UGI. > The {{checkTGTAndReloginFromKeytab()}} method checks if there is a keytab > ({{isKeytab}} UGI instance variable) associated with the UGI, if there is one > it triggers a call to {{reloginFromKeytab()}}. The problem is that the > {{keytabFile}} UGI instance variable is NULL, and that triggers the mentioned > {{IOException}}. > The root of the problem seems to be when creating a UGI via the > {{UGI.loginUserFromSubject(Subject)}} method, this method uses the > {{UserGroupInformation(Subject)}} constructor, and this constructor does the > following to determine if there is a keytab or not. > {code} > this.isKeytab = KerberosUtil.hasKerberosKeyTab(subject); > {code} > If the {{Subject}} given had a keytab, then the UGI instance will have the > {{isKeytab}} set to TRUE. > It sets the UGI instance as it would have a keytab because the Subject has a > keytab. This has 2 problems: > First, it does not set the keytab file (and this, having the {{isKeytab}} set > to TRUE and the {{keytabFile}} set to NULL) is what triggers the > {{IOException}} in the method {{reloginFromKeytab()}}. > Second (and even if the first problem is fixed, this still is a problem), it > assumes that because the subject has a keytab it is up to UGI to do the > relogin using the keytab. This is incorrect if the UGI was created using the > {{UGI.loginUserFromSubject(Subject)}} method. In such case, the owner of the > Subject is not the UGI, but the caller, so the caller is responsible for > renewing the Kerberos tickets and the UGI should not try to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13378) Common features between YARN and HDFS Router-based federation
[ https://issues.apache.org/jira/browse/HADOOP-13378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13378: --- Description: HDFS-10467 uses a similar architecture to the one proposed in YARN-2915. This JIRA tries to identify what is common between these two efforts and try to build a common framework. (was: HDFS-10647 uses a similar architecture to the one proposed in YARN-2915. This JIRA tries to identify what is common between these two efforts and try to build a common framework.) > Common features between YARN and HDFS Router-based federation > - > > Key: HADOOP-13378 > URL: https://issues.apache.org/jira/browse/HADOOP-13378 > Project: Hadoop Common > Issue Type: New Feature >Reporter: Inigo Goiri > > HDFS-10467 uses a similar architecture to the one proposed in YARN-2915. This > JIRA tries to identify what is common between these two efforts and try to > build a common framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13061) Refactor erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13061: --- Labels: hdfs-ec-3.0-must-do (was: ) > Refactor erasure coders > --- > > Key: HADOOP-13061 > URL: https://issues.apache.org/jira/browse/HADOOP-13061 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Rui Li >Assignee: Kai Sasaki > Labels: hdfs-ec-3.0-must-do > Attachments: HADOOP-13061.01.patch, HADOOP-13061.02.patch, > HADOOP-13061.03.patch, HADOOP-13061.04.patch, HADOOP-13061.05.patch, > HADOOP-13061.06.patch, HADOOP-13061.07.patch, HADOOP-13061.08.patch, > HADOOP-13061.09.patch, HADOOP-13061.10.patch, HADOOP-13061.11.patch, > HADOOP-13061.12.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13665) Erasure Coding codec should support fallback coder
[ https://issues.apache.org/jira/browse/HADOOP-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13665: --- Labels: hdfs-ec-3.0-must-do (was: ) > Erasure Coding codec should support fallback coder > -- > > Key: HADOOP-13665 > URL: https://issues.apache.org/jira/browse/HADOOP-13665 > Project: Hadoop Common > Issue Type: Sub-task > Components: io >Reporter: Wei-Chiu Chuang > Labels: hdfs-ec-3.0-must-do > > The current EC codec supports a single coder only (by default pure Java > implementation). If the native coder is specified but is unavailable, it > should fallback to pure Java implementation. > One possible solution is to follow the convention of existing Hadoop native > codec, such as transport encryption (see {{CryptoCodec.java}}). It supports > fallback by specifying two or multiple coders as the value of property, and > loads coders in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13200) Seeking a better approach allowing to customize and configure erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13200: --- Labels: hdfs-ec-3.0-must-do (was: ) > Seeking a better approach allowing to customize and configure erasure coders > > > Key: HADOOP-13200 > URL: https://issues.apache.org/jira/browse/HADOOP-13200 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Labels: hdfs-ec-3.0-must-do > > This is a follow-on task for HADOOP-13010 as discussed over there. There may > be some better approach allowing to customize and configure erasure coders > than the current having raw coder factory, as [~cmccabe] suggested. Will copy > the relevant comments here to continue the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Assignee: (was: Zhe Zhang) > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang > Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, > HADOOP-13055.02.patch > > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543769#comment-15543769 ] Zhe Zhang commented on HADOOP-13055: Sorry for getting back to this late. [~shv] Yes the patch only implements {{linkMergeSlash}} instead of {{linkMerge}} in general. [~manojg] Thanks for the interest! Yes it would be great if you can take over this task. Unassigning myself now. I'll get back to your question after refreshing my own memory on the patch. > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, > HADOOP-13055.02.patch > > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
[ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15524152#comment-15524152 ] Zhe Zhang commented on HADOOP-13657: Thanks [~kihwal]. Linking the issue for now. I think in these 2 issues {{Reader}} died for different reasons, but maybe the solution is similar. I don't have a patch either. > IPC Reader thread could silently die and leave NameNode unresponsive > > > Key: HADOOP-13657 > URL: https://issues.apache.org/jira/browse/HADOOP-13657 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Reporter: Zhe Zhang >Priority: Critical > > For each listening port, IPC {{Server#Listener#Reader}} is a single thread in > charge of moving {{Connection}} items from {{pendingConnections}} (capacity > 100) to the {{callQueue}}. > We have experienced an incident where the {{Reader}} thread for HDFS NameNode > died from runtime exception. Then the {{pendingConnections}} queue became > full and the NameNode port became inaccessible. > In our particular case, what killed {{Reader}} was a NPE caused by > https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types > of runtime exceptions could cause this issue as well. > We should add logic to either make the {{Reader}} more robust in case of > runtime exceptions, or at least treat it as a FATAL exception so that > NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
[ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13657: --- Description: For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}. We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from runtime exception. Then the {{pendingConnections}} queue became full and the NameNode port became inaccessible. In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well. We should add logic to either make the {{Reader}} more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue. was: For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}. We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from run time exception. Then the {{pendingConnections}} queue became full and the NameNode port became inaccessible. In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well. We should add logic to either make the {{Reader}} more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue. > IPC Reader thread could silently die and leave NameNode unresponsive > > > Key: HADOOP-13657 > URL: https://issues.apache.org/jira/browse/HADOOP-13657 > Project: Hadoop Common > Issue Type: Bug > Components: ipc >Reporter: Zhe Zhang >Priority: Critical > > For each listening port, IPC {{Server#Listener#Reader}} is a single thread in > charge of moving {{Connection}} items from {{pendingConnections}} (capacity > 100) to the {{callQueue}}. > We have experienced an incident where the {{Reader}} thread for HDFS NameNode > died from runtime exception. Then the {{pendingConnections}} queue became > full and the NameNode port became inaccessible. > In our particular case, what killed {{Reader}} was a NPE caused by > https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types > of runtime exceptions could cause this issue as well. > We should add logic to either make the {{Reader}} more robust in case of > runtime exceptions, or at least treat it as a FATAL exception so that > NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
Zhe Zhang created HADOOP-13657: -- Summary: IPC Reader thread could silently die and leave NameNode unresponsive Key: HADOOP-13657 URL: https://issues.apache.org/jira/browse/HADOOP-13657 Project: Hadoop Common Issue Type: Bug Components: ipc Reporter: Zhe Zhang Priority: Critical For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}. We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from run time exception. Then the {{pendingConnections}} queue became full and the NameNode port became inaccessible. In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883. But in general, other types of runtime exceptions could cause this issue as well. We should add logic to either make the {{Reader}} more robust in case of runtime exceptions, or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins get alerted of the real issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13558) UserGroupInformation created from a Subject incorrectly tries to renew the Kerberos ticket
[ https://issues.apache.org/jira/browse/HADOOP-13558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15478664#comment-15478664 ] Zhe Zhang commented on HADOOP-13558: [~xiaochen] Thanks for the fix! Since it affects earlier versions do you mind porting it to branch-2.7? I can also help with that. > UserGroupInformation created from a Subject incorrectly tries to renew the > Kerberos ticket > -- > > Key: HADOOP-13558 > URL: https://issues.apache.org/jira/browse/HADOOP-13558 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.7.2, 2.6.4, 3.0.0-alpha2 >Reporter: Alejandro Abdelnur >Assignee: Xiao Chen > Fix For: 2.9.0, 3.0.0-alpha2 > > Attachments: HADOOP-13558.01.patch, HADOOP-13558.02.patch > > > The UGI {{checkTGTAndReloginFromKeytab()}} method checks certain conditions > and if they are met it invokes the {{reloginFromKeytab()}}. The > {{reloginFromKeytab()}} method then fails with an {{IOException}} > "loginUserFromKeyTab must be done first" because there is no keytab > associated with the UGI. > The {{checkTGTAndReloginFromKeytab()}} method checks if there is a keytab > ({{isKeytab}} UGI instance variable) associated with the UGI, if there is one > it triggers a call to {{reloginFromKeytab()}}. The problem is that the > {{keytabFile}} UGI instance variable is NULL, and that triggers the mentioned > {{IOException}}. > The root of the problem seems to be when creating a UGI via the > {{UGI.loginUserFromSubject(Subject)}} method, this method uses the > {{UserGroupInformation(Subject)}} constructor, and this constructor does the > following to determine if there is a keytab or not. > {code} > this.isKeytab = KerberosUtil.hasKerberosKeyTab(subject); > {code} > If the {{Subject}} given had a keytab, then the UGI instance will have the > {{isKeytab}} set to TRUE. > It sets the UGI instance as it would have a keytab because the Subject has a > keytab. This has 2 problems: > First, it does not set the keytab file (and this, having the {{isKeytab}} set > to TRUE and the {{keytabFile}} set to NULL) is what triggers the > {{IOException}} in the method {{reloginFromKeytab()}}. > Second (and even if the first problem is fixed, this still is a problem), it > assumes that because the subject has a keytab it is up to UGI to do the > relogin using the keytab. This is incorrect if the UGI was created using the > {{UGI.loginUserFromSubject(Subject)}} method. In such case, the owner of the > Subject is not the UGI, but the caller, so the caller is responsible for > renewing the Kerberos tickets and the UGI should not try to do so. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13535) Add jetty6 acceptor startup issue workaround to branch-2
[ https://issues.apache.org/jira/browse/HADOOP-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446690#comment-15446690 ] Zhe Zhang commented on HADOOP-13535: Thanks Min for taking on this work. Please try again. > Add jetty6 acceptor startup issue workaround to branch-2 > > > Key: HADOOP-13535 > URL: https://issues.apache.org/jira/browse/HADOOP-13535 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Wei-Chiu Chuang >Assignee: Min Shen > > After HADOOP-12765 is committed to branch-2, the handling of SSL connection > by HttpServer2 may suffer the same Jetty bug described in HADOOP-10588. We > should consider adding the same workaround for SSL connection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13535) Add jetty6 acceptor startup issue workaround to branch-2
[ https://issues.apache.org/jira/browse/HADOOP-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13535: --- Assignee: Min Shen > Add jetty6 acceptor startup issue workaround to branch-2 > > > Key: HADOOP-13535 > URL: https://issues.apache.org/jira/browse/HADOOP-13535 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Wei-Chiu Chuang >Assignee: Min Shen > > After HADOOP-12765 is committed to branch-2, the handling of SSL connection > by HttpServer2 may suffer the same Jetty bug described in HADOOP-10588. We > should consider adding the same workaround for SSL connection. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Resolution: Fixed Status: Resolved (was: Patch Available) > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446455#comment-15446455 ] Zhe Zhang commented on HADOOP-12765: Thanks for the feedback [~jojochuang]. I resolved both conflicts and backported this change to branch-2.7. Agreed HADOOP-12688 would be a nice improvement. I tried backporting but it was not quite clean. > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Fix Version/s: 2.7.4 > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Attachment: HADOOP-13055.02.patch Updating patch to fix unit test failure, and improve {{resolve]} logic. > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch, > HADOOP-13055.02.patch > > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Fix Version/s: 2.9.0 > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433872#comment-15433872 ] Zhe Zhang commented on HADOOP-12765: I committed to branch-2 and branch-2.8. But backporting to branch-2.7 is having a conflict on the pom files. [~mshen] [~jojochuang] Could you help take a look? Thanks. > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 2.9.0, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12668) Support excluding weak Ciphers in HttpServer2 through ssl-server.conf
[ https://issues.apache.org/jira/browse/HADOOP-12668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433735#comment-15433735 ] Zhe Zhang commented on HADOOP-12668: I cherry-picked this to branch-2.7 in support of HADOOP-12765 > Support excluding weak Ciphers in HttpServer2 through ssl-server.conf > -- > > Key: HADOOP-12668 > URL: https://issues.apache.org/jira/browse/HADOOP-12668 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Affects Versions: 2.7.1 >Reporter: Vijay Singh >Assignee: Vijay Singh >Priority: Critical > Labels: common, ha, hadoop, hdfs, security > Fix For: 2.8.0, 2.7.4 > > Attachments: Hadoop-12668.006.patch, Hadoop-12668.007.patch, > Hadoop-12668.008.patch, Hadoop-12668.009.patch, Hadoop-12668.010.patch, > Hadoop-12668.011.patch, Hadoop-12668.012.patch, test.log > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently Embeded jetty Server used across all hadoop services is configured > through ssl-server.xml file from their respective configuration section. > However, the SSL/TLS protocol being used for this jetty servers can be > downgraded to weak cipher suites. This code changes aims to add following > functionality: > 1) Add logic in hadoop common (HttpServer2.java and associated interfaces) to > spawn jetty servers with ability to exclude weak cipher suites. I propose we > make this though ssl-server.xml and hence each service can choose to disable > specific ciphers. > 2) Modify DFSUtil.java used by HDFS code to supply new parameter > ssl.server.exclude.cipher.list for hadoop-common code, so it can exclude the > ciphers supplied through this key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12668) Support excluding weak Ciphers in HttpServer2 through ssl-server.conf
[ https://issues.apache.org/jira/browse/HADOOP-12668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12668: --- Fix Version/s: 2.7.4 > Support excluding weak Ciphers in HttpServer2 through ssl-server.conf > -- > > Key: HADOOP-12668 > URL: https://issues.apache.org/jira/browse/HADOOP-12668 > Project: Hadoop Common > Issue Type: Improvement > Components: security >Affects Versions: 2.7.1 >Reporter: Vijay Singh >Assignee: Vijay Singh >Priority: Critical > Labels: common, ha, hadoop, hdfs, security > Fix For: 2.8.0, 2.7.4 > > Attachments: Hadoop-12668.006.patch, Hadoop-12668.007.patch, > Hadoop-12668.008.patch, Hadoop-12668.009.patch, Hadoop-12668.010.patch, > Hadoop-12668.011.patch, Hadoop-12668.012.patch, test.log > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently Embeded jetty Server used across all hadoop services is configured > through ssl-server.xml file from their respective configuration section. > However, the SSL/TLS protocol being used for this jetty servers can be > downgraded to weak cipher suites. This code changes aims to add following > functionality: > 1) Add logic in hadoop common (HttpServer2.java and associated interfaces) to > spawn jetty servers with ability to exclude weak cipher suites. I propose we > make this though ssl-server.xml and hence each service can choose to disable > specific ciphers. > 2) Modify DFSUtil.java used by HDFS code to supply new parameter > ssl.server.exclude.cipher.list for hadoop-common code, so it can exclude the > ciphers supplied through this key. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Fix Version/s: 2.8.0 > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 2.8.0, 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Target Version/s: 2.7.4 (was: 2.9.0) > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433671#comment-15433671 ] Zhe Zhang commented on HADOOP-12765: Just noticed that the branch-2 patch has already passed Jenkins. +1. I will commit shortly. > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Attachment: HADOOP-13055.01.patch Updating patch: # Fixing a bug in initializing {{root}} which caused the unit test failures # Enforcing that merge slash and regular links don't co-exist # Add unit test for above > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13055.00.patch, HADOOP-13055.01.patch > > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Attachment: HADOOP-13055.00.patch Pretty rough initial patch to test whether the idea breaks any existing unit tests. I'm still working on: # Enforcing {{linkMergeSlash}} is not used together with regular links # An issue on {{ViewFileSystem#getFileStatus}} causing the returned status to have a wrong path (the {{LocatedFileStatus}} that it wraps is correct) # More unit tests > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13055.00.patch > > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Status: Patch Available (was: Open) > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433234#comment-15433234 ] Zhe Zhang commented on HADOOP-13055: I think we need to make two changes to {{InodeTree}}: # {{root}} of a mount table can either be an {{INodeDir}} or an {{INodeLink}}. So we should make it an {{INode}} and assign its value after checking the configurations (at the end of the for loop in the constructor). # Enforce that when {{linkMergeSlash}} is configured, no other links can be configured for that mount table I'm writing a patch to implement the above. Any thoughts are very welcome. > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15433134#comment-15433134 ] Zhe Zhang commented on HADOOP-12765: Thanks [~jojochuang]. Branch-2 patch LGTM. +1 pending Jenkins. The conflict is caused by HADOOP-10588. It's only in branch-2, not trunk. I'll file a JIRA to address. > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12765) HttpServer2 should switch to using the non-blocking SslSelectChannelConnector to prevent performance degradation when handling SSL connections
[ https://issues.apache.org/jira/browse/HADOOP-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-12765: --- Fix Version/s: 3.0.0-alpha2 > HttpServer2 should switch to using the non-blocking SslSelectChannelConnector > to prevent performance degradation when handling SSL connections > -- > > Key: HADOOP-12765 > URL: https://issues.apache.org/jira/browse/HADOOP-12765 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.3 >Reporter: Min Shen >Assignee: Min Shen > Fix For: 3.0.0-alpha2 > > Attachments: HADOOP-12765-branch-2.patch, HADOOP-12765.001.patch, > HADOOP-12765.001.patch, HADOOP-12765.002.patch, HADOOP-12765.003.patch, > HADOOP-12765.004.patch, HADOOP-12765.005.patch, blocking_1.png, > blocking_2.png, unblocking.png > > > The current implementation uses the blocking SslSocketConnector which takes > the default maxIdleTime as 200 seconds. We noticed in our cluster that when > users use a custom client that accesses the WebHDFS REST APIs through https, > it could block all the 250 handler threads in NN jetty server, causing severe > performance degradation for accessing WebHDFS and NN web UI. Attached > screenshots (blocking_1.png and blocking_2.png) illustrate that when using > SslSocketConnector, the jetty handler threads are not released until the 200 > seconds maxIdleTime has passed. With sufficient number of SSL connections, > this issue could render NN HttpServer to become entirely irresponsive. > We propose to use the non-blocking SslSelectChannelConnector as a fix. We > have deployed the attached patch within our cluster, and have seen > significant improvement. The attached screenshot (unblocking.png) further > illustrates the behavior of NN jetty server after switching to using > SslSelectChannelConnector. > The patch further disables SSLv3 protocol on server side to preserve the > spirit of HADOOP-11260. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Comment: was deleted (was: Actually, after looking more at the code, what I really want is the ability to mount the root of an HDFS to some path in the mount table (not necessarily the root of the mount table). This is already possible. Unassigning myself from this JIRA.) > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13055: --- Target Version/s: 2.7.4 (was: 2.8.0) > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Assigned] (HADOOP-13055) Implement linkMergeSlash for ViewFs
[ https://issues.apache.org/jira/browse/HADOOP-13055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang reassigned HADOOP-13055: -- Assignee: Zhe Zhang > Implement linkMergeSlash for ViewFs > --- > > Key: HADOOP-13055 > URL: https://issues.apache.org/jira/browse/HADOOP-13055 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, viewfs >Reporter: Zhe Zhang >Assignee: Zhe Zhang > > In a multi-cluster environment it is sometimes useful to operate on the root > / slash directory of an HDFS cluster. E.g., list all top level directories. > Quoting the comment in {{ViewFs}}: > {code} > * A special case of the merge mount is where mount table's root is merged > * with the root (slash) of another file system: > * > * fs.viewfs.mounttable.default.linkMergeSlash=hdfs://nn99/ > * > * In this cases the root of the mount table is merged with the root of > *hdfs://nn99/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13408) TestUTF8 fails in branch-2.6
[ https://issues.apache.org/jira/browse/HADOOP-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13408: --- Assignee: Ye Zhou > TestUTF8 fails in branch-2.6 > > > Key: HADOOP-13408 > URL: https://issues.apache.org/jira/browse/HADOOP-13408 > Project: Hadoop Common > Issue Type: Bug > Components: test >Reporter: Zhe Zhang >Assignee: Ye Zhou >Priority: Minor > Labels: newbie > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Moved] (HADOOP-13408) TestUTF8 fails in branch-2.6
[ https://issues.apache.org/jira/browse/HADOOP-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang moved HDFS-10680 to HADOOP-13408: --- Target Version/s: 2.6.5 (was: 2.6.5) Component/s: (was: test) test Key: HADOOP-13408 (was: HDFS-10680) Project: Hadoop Common (was: Hadoop HDFS) > TestUTF8 fails in branch-2.6 > > > Key: HADOOP-13408 > URL: https://issues.apache.org/jira/browse/HADOOP-13408 > Project: Hadoop Common > Issue Type: Bug > Components: test >Reporter: Zhe Zhang >Priority: Minor > Labels: newbie > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client
[ https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386692#comment-15386692 ] Zhe Zhang commented on HADOOP-13206: Thanks much for the suggestion [~cnauroth]. Although in our case the issue happened with 2.6 and 2.3 clients, now I think it can happen with the same version of client for 2 reasons. Let's assume there are client {{A}}, which fetches tokens, and client {{B}}, which uses tokens. # Client {{A}} and client {{B}} could use different values of {{hadoop.security.token.service.use_ip}}. Should we treat this as a mis-configuration and enforce the same value across any entire production environment? # Client {{A}}, when fetching the token, could use numerical IP address to refer to the NameNode, such as {{webhdfs://123.45.67.89:50070}}. Client {{B}}, when using the token, could use a logical URI {{webhdfs://clusterNN}}. Good point about DNS overhead. How about we update the patch and only do the newly added check if one URI is logical and the other is not? > Delegation token cannot be fetched and used by different versions of client > --- > > Key: HADOOP-13206 > URL: https://issues.apache.org/jira/browse/HADOOP-13206 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.3.0, 2.6.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, > HADOOP-13206.02.patch > > > We have observed that an HDFS delegation token fetched by a 2.3.0 client > cannot be used by a 2.6.1 client, and vice versa. Through some debugging I > found that it's a mismatch between the token's {{service}} and the > {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). > One would be in numerical IP address and one would be in non-numerical > hostname format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client
[ https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386428#comment-15386428 ] Zhe Zhang commented on HADOOP-13206: Thanks for the review [~shv]. I'll post a patch to address soon. > Delegation token cannot be fetched and used by different versions of client > --- > > Key: HADOOP-13206 > URL: https://issues.apache.org/jira/browse/HADOOP-13206 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.3.0, 2.6.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, > HADOOP-13206.02.patch > > > We have observed that an HDFS delegation token fetched by a 2.3.0 client > cannot be used by a 2.6.1 client, and vice versa. Through some debugging I > found that it's a mismatch between the token's {{service}} and the > {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). > One would be in numerical IP address and one would be in non-numerical > hostname format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client
[ https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386426#comment-15386426 ] Zhe Zhang commented on HADOOP-13206: Just found HADOOP-7733 had some very relevant discussions on this issue. [~cnauroth], [~daryn]: could you take a look at the patch posted here too? Thanks. I don't this is necessarily a mis-config, because a token can potentially be fetched and used by different (and maybe different versions) of clients. > Delegation token cannot be fetched and used by different versions of client > --- > > Key: HADOOP-13206 > URL: https://issues.apache.org/jira/browse/HADOOP-13206 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.3.0, 2.6.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, > HADOOP-13206.02.patch > > > We have observed that an HDFS delegation token fetched by a 2.3.0 client > cannot be used by a 2.6.1 client, and vice versa. Through some debugging I > found that it's a mismatch between the token's {{service}} and the > {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). > One would be in numerical IP address and one would be in non-numerical > hostname format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13206) Delegation token cannot be fetched and used by different versions of client
[ https://issues.apache.org/jira/browse/HADOOP-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385049#comment-15385049 ] Zhe Zhang commented on HADOOP-13206: I did more debugging and found the reason why different version of client return different formats of {{service}}. In *trunk*, {{WebHdfsFileSystem#getDelegationToken}} sets {{service}} as: {code} if (token != null) { token.setService(tokenServiceName); {code} {{tokenServiceName}} is set as following: {code} this.tokenServiceName = isLogicalUri ? HAUtilClient.buildTokenServiceForLogicalUri(uri, getScheme()) : SecurityUtil.buildTokenService(getCanonicalUri()); {code} This essentially will create a logical URI like {{webhdfs://myhost}}. In *branch-2.3*, the logic is as below, which results in numerical IPs. {code} SecurityUtil.setTokenService(token, getCurrentNNAddr()); ... this.nnAddrs = DFSUtil.resolveWebHdfsUri(this.uri, conf); ... /** * Resolve an HDFS URL into real INetSocketAddress. It works like a DNS resolver * when the URL points to an non-HA cluster. When the URL points to an HA * cluster, the resolver further resolves the logical name (i.e., the authority * in the URL) into real namenode addresses. */ public static InetSocketAddress[] resolveWebHdfsUri(URI uri, Configuration conf) throws IOException { int defaultPort; String scheme = uri.getScheme(); if (WebHdfsFileSystem.SCHEME.equals(scheme)) { defaultPort = DFSConfigKeys.DFS_NAMENODE_HTTP_PORT_DEFAULT; } else if (SWebHdfsFileSystem.SCHEME.equals(scheme)) { defaultPort = DFSConfigKeys.DFS_NAMENODE_HTTPS_PORT_DEFAULT; } else { throw new IllegalArgumentException("Unsupported scheme: " + scheme); } ArrayList ret = new ArrayList(); if (!HAUtil.isLogicalUri(conf, uri)) { InetSocketAddress addr = NetUtils.createSocketAddr(uri.getAuthority(), defaultPort); ret.add(addr); } else { Map> addresses = DFSUtil .getHaNnWebHdfsAddresses(conf, scheme); for (Map addrs : addresses.values()) { for (InetSocketAddress addr : addrs.values()) { ret.add(addr); } } } InetSocketAddress[] r = new InetSocketAddress[ret.size()]; return ret.toArray(r); {code} It's hard to add a unit test because we can't emulate a version 2.3 client in trunk code. But hope the above explanation is clear enough. > Delegation token cannot be fetched and used by different versions of client > --- > > Key: HADOOP-13206 > URL: https://issues.apache.org/jira/browse/HADOOP-13206 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 2.3.0, 2.6.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HADOOP-13206.00.patch, HADOOP-13206.01.patch, > HADOOP-13206.02.patch > > > We have observed that an HDFS delegation token fetched by a 2.3.0 client > cannot be used by a 2.6.1 client, and vice versa. Through some debugging I > found that it's a mismatch between the token's {{service}} and the > {{service}} of the filesystem (e.g. {{webhdfs://host.something.com:50070/}}). > One would be in numerical IP address and one would be in non-numerical > hostname format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13289) Remove unused variables in TestFairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380226#comment-15380226 ] Zhe Zhang commented on HADOOP-13289: Thanks Ye for the patch and Akira for reviewing / committing. I'd like to backport this to 2.6 but from 2.8 to 2.7 the cherry-pick is not clean. [~zhouyejoe] Do you mind posting a patch for branch-2.7? Thanks. > Remove unused variables in TestFairCallQueue > > > Key: HADOOP-13289 > URL: https://issues.apache.org/jira/browse/HADOOP-13289 > Project: Hadoop Common > Issue Type: Improvement > Components: test >Affects Versions: 2.6.0 >Reporter: Konstantin Shvachko >Assignee: Ye Zhou >Priority: Minor > Labels: newbie > Fix For: 2.8.0 > > Attachments: HADOOP-13289.001.patch > > > # Remove unused member {{alwaysZeroScheduler}} and related initialization in > {{TestFairCallQueue}} > # Remove unused local vriable {{sched}} in > {{testOfferSucceedsWhenScheduledLowPriority()}} > And propagate to applicable release branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13290) Appropriate use of generics in FairCallQueue
[ https://issues.apache.org/jira/browse/HADOOP-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HADOOP-13290: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0-alpha1 2.6.5 2.9.0 2.7.3 2.8.0 Status: Resolved (was: Patch Available) Thanks Jonathan for the fix and Konstantin for the review. I just committed the patch to trunk~branch2.6. > Appropriate use of generics in FairCallQueue > > > Key: HADOOP-13290 > URL: https://issues.apache.org/jira/browse/HADOOP-13290 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc >Affects Versions: 2.6.0 >Reporter: Konstantin Shvachko >Assignee: Jonathan Hung > Labels: newbie++ > Fix For: 2.8.0, 2.7.3, 2.9.0, 2.6.5, 3.0.0-alpha1 > > Attachments: HADOOP-13290.001.patch, HADOOP-13290.002.patch > > > # {{BlockingQueue}} is intermittently used with and without generic > parameters in {{FairCallQueue}} class. Should be parameterized. > # Same for {{FairCallQueue}}. Should be parameterized. Could be a bit more > tricky for that one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org