[jira] [Resolved] (HDFS-16686) GetJournalEditServlet fails to authorize valid Kerberos request
[ https://issues.apache.org/jira/browse/HDFS-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-16686. - Fix Version/s: 3.3.9 Hadoop Flags: Reviewed Resolution: Fixed > GetJournalEditServlet fails to authorize valid Kerberos request > --- > > Key: HDFS-16686 > URL: https://issues.apache.org/jira/browse/HDFS-16686 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node >Affects Versions: 3.4.0, 3.3.9 > Environment: Running in Kubernetes using Java 11 in an HA > configuration. JournalNodes run on separate pods and have their own Kerberos > principal "jn/@". >Reporter: Steve Vaughan >Assignee: Steve Vaughan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > > GetJournalEditServlet uses request.getRemoteuser() to determine the > remoteShortName for Kerberos authorization, which fails to match when the > JournalNode uses its own Kerberos principal (e.g. jn/@). > This can be fixed by using the UserGroupInformation provided by the base > DfsServlet class using the getUGI(request, conf) call. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-4043) Namenode Kerberos Login does not use proper hostname for host qualified hdfs principal name.
[ https://issues.apache.org/jira/browse/HDFS-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-4043. Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Namenode Kerberos Login does not use proper hostname for host qualified hdfs > principal name. > > > Key: HDFS-4043 > URL: https://issues.apache.org/jira/browse/HDFS-4043 > Project: Hadoop HDFS > Issue Type: Bug > Components: security >Affects Versions: 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 2.0.3-alpha, > 3.4.0, 3.3.9 > Environment: CDH4U1 on Ubuntu 12.04 >Reporter: Ahad Rana >Assignee: Steve Vaughan >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Original Estimate: 24h > Time Spent: 50m > Remaining Estimate: 23h 10m > > The Namenode uses the loginAsNameNodeUser method in NameNode.java to login > using the hdfs principal. This method in turn invokes SecurityUtil.login with > a hostname (last parameter) obtained via a call to InetAddress.getHostName. > This call does not always return the fully qualified host name, and thus > causes the namenode to login to fail due to kerberos's inability to find a > matching hdfs principal in the hdfs.keytab file. Instead it should use > InetAddress.getCanonicalHostName. This is consistent with what is used > internally by SecurityUtil.java to login in other services, such as the > DataNode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error
[ https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-16702. - Hadoop Flags: Reviewed Resolution: Fixed > MiniDFSCluster should report cause of exception in assertion error > -- > > Key: HDFS-16702 > URL: https://issues.apache.org/jira/browse/HDFS-16702 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Environment: Tests running in the Hadoop dev environment image. >Reporter: Steve Vaughan >Assignee: Steve Vaughan >Priority: Minor > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > When the MiniDFSClsuter detects that an exception caused an exit, it should > include that exception as the cause for the AssertionError that it throws. > The current AssertError simply reports the message "Test resulted in an > unexpected exit" and provides a stack trace to the location of the check for > an exit exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress
[ https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-16507. - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > [SBN read] Avoid purging edit log which is in progress > -- > > Key: HDFS-16507 > URL: https://issues.apache.org/jira/browse/HDFS-16507 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: tomscut >Assignee: tomscut >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL > exception. It looks like it's purging edit logs which is in process. > According to the analysis, I suspect that the editlog which is in progress to > be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN > rolls edit its self. > The stack: > {code:java} > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > > org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185) > > org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388) > > org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620) > > org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512) > org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177) > > org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515) > javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > org.eclipse.jetty.server.Server.handle(Server.java:539) > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > >
[jira] [Resolved] (HDFS-16410) Insecure Xml parsing in OfflineEditsXmlLoader
[ https://issues.apache.org/jira/browse/HDFS-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-16410. - Fix Version/s: 3.4.0 3.3.2 Resolution: Fixed > Insecure Xml parsing in OfflineEditsXmlLoader > -- > > Key: HDFS-16410 > URL: https://issues.apache.org/jira/browse/HDFS-16410 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available, security > Fix For: 3.4.0, 3.3.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Insecure Xml parsing in OfflineEditsXmlLoader > [https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsXmlLoader.java#L88] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15754) Create packet metrics for DataNode
[ https://issues.apache.org/jira/browse/HDFS-15754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-15754. - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Create packet metrics for DataNode > -- > > Key: HDFS-15754 > URL: https://issues.apache.org/jira/browse/HDFS-15754 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > In BlockReceiver, right now when there is slowness in writeToMirror, > writeToDisk and writeToOsCache, it is dumped in the debug log. In practice we > have found these are quite useful signal to detect issues in DataNode, so it > will be great these metrics can be exposed by JMX. > Also we introduced totalPacket received to use a percentage as a signal to > detect the potentially underperforming datanode since datanodes across one > HDFS cluster may received different numbers of packets totally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15708) TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and branch-3.2
[ https://issues.apache.org/jira/browse/HDFS-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-15708. - Fix Version/s: 3.3.1 3.2.2 Hadoop Flags: Reviewed Resolution: Fixed > TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and > branch-3.2 > --- > > Key: HDFS-15708 > URL: https://issues.apache.org/jira/browse/HDFS-15708 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: test >Reporter: Akira Ajisaka >Assignee: Chao Sun >Priority: Blocker > Labels: pull-request-available > Fix For: 3.2.2, 3.3.1 > > Time Spent: 50m > Remaining Estimate: 0h > > TestURLConnectionFactory#testSSLFactoryCleanup fails: > {noformat} > [ERROR] > testSSLFactoryCleanup(org.apache.hadoop.hdfs.web.TestURLConnectionFactory) > Time elapsed: 0.28 s <<< ERROR! > java.lang.NoClassDefFoundError: > org/bouncycastle/x509/X509V1CertificateGenerator > at > org.apache.hadoop.security.ssl.KeyStoreTestUtil.generateCertificate(KeyStoreTestUtil.java:86) > at > org.apache.hadoop.security.ssl.KeyStoreTestUtil.setupSSLConfig(KeyStoreTestUtil.java:273) > at > org.apache.hadoop.security.ssl.KeyStoreTestUtil.setupSSLConfig(KeyStoreTestUtil.java:228) > at > org.apache.hadoop.hdfs.web.TestURLConnectionFactory.testSSLFactoryCleanup(TestURLConnectionFactory.java:83) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > Caused by: java.lang.ClassNotFoundException: > org.bouncycastle.x509.X509V1CertificateGenerator > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:418) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:351) > ... 29 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15690) Add lz4-java as hadoop-hdfs test dependency
[ https://issues.apache.org/jira/browse/HDFS-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-15690. - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Committed to trunk. I'll backport this to 3.3 branch together with HADOOP-17292 later. > Add lz4-java as hadoop-hdfs test dependency > --- > > Key: HDFS-15690 > URL: https://issues.apache.org/jira/browse/HDFS-15690 > Project: Hadoop HDFS > Issue Type: Test >Reporter: L. C. Hsieh >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > TestFSImage.testNativeCompression fails with "java.lang.NoClassDefFoundError: > net/jpountz/lz4/LZ4Factory": > https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/305/testReport/junit/org.apache.hadoop.hdfs.server.namenode/TestFSImage/testNativeCompression/ > We need to add lz4-java to hadoop-hdfs test dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature
Chao Sun created HDFS-15601: --- Summary: Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature Key: HDFS-15601 URL: https://issues.apache.org/jira/browse/HDFS-15601 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Chao Sun HDFS-13616 requires both server and client side change. However, it is common that users use a newer client to talk to older HDFS (say 2.10). Currently the client will simply fail in this scenario. A better approach, perhaps, is to have client fallback to use non-batched listing on the input directories. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15014) RBF: WebHdfs chooseDatanode shouldn't call getDatanodeReport
[ https://issues.apache.org/jira/browse/HDFS-15014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-15014. - Resolution: Duplicate > RBF: WebHdfs chooseDatanode shouldn't call getDatanodeReport > - > > Key: HDFS-15014 > URL: https://issues.apache.org/jira/browse/HDFS-15014 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Chao Sun >Priority: Major > > Currently the {{chooseDatanode}} call (which is shared by {{open}}, > {{create}}, {{append}} and {{getFileChecksum}}) in RBF WebHDFS calls > {{getDatanodeReport}} from ALL downstream namenodes: > {code} > private DatanodeInfo chooseDatanode(final Router router, > final String path, final HttpOpParam.Op op, final long openOffset, > final String excludeDatanodes) throws IOException { > // We need to get the DNs as a privileged user > final RouterRpcServer rpcServer = getRPCServer(router); > UserGroupInformation loginUser = UserGroupInformation.getLoginUser(); > RouterRpcServer.setCurrentUser(loginUser); > DatanodeInfo[] dns = null; > try { > dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE); > } catch (IOException e) { > LOG.error("Cannot get the datanodes from the RPC server", e); > } finally { > // Reset ugi to remote user for remaining operations. > RouterRpcServer.resetCurrentUser(); > } > HashSet excludes = new HashSet(); > if (excludeDatanodes != null) { > Collection collection = > getTrimmedStringCollection(excludeDatanodes); > for (DatanodeInfo dn : dns) { > if (collection.contains(dn.getName())) { > excludes.add(dn); > } > } > } > ... > {code} > The {{getDatanodeReport}} is very expensive (particularly in a large cluster) > as it need to lock the {{DatanodeManager}} which is also shared by calls such > as processing heartbeats. Check HDFS-14366 for a similar issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15465) Support WebHDFS accesses to the data stored in secure Datanode through insecure Namenode
[ https://issues.apache.org/jira/browse/HDFS-15465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-15465. - Fix Version/s: 3.4.0 Resolution: Fixed > Support WebHDFS accesses to the data stored in secure Datanode through > insecure Namenode > > > Key: HDFS-15465 > URL: https://issues.apache.org/jira/browse/HDFS-15465 > Project: Hadoop HDFS > Issue Type: Wish > Components: federation, webhdfs >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Minor > Fix For: 3.4.0 > > Attachments: webhdfs-federation.pdf > > > We're federating a secure HDFS cluster with an insecure cluster. > Using HDFS RPC, we can access the data managed by insecure Namenode and > stored in secure Datanode. > However, it does not work for WebHDFS due to HadoopIllegalArgumentException. > {code} > $ curl -i "http://:/webhdfs/v1/?op=OPEN" > HTTP/1.1 307 TEMPORARY_REDIRECT > (omitted) > Location: > http://:/webhdfs/v1/?op=OPEN==0 > $ curl -i > "http://:/webhdfs/v1/?op=OPEN==0" > HTTP/1.1 400 Bad Request > (omitted) > {"RemoteException":{"exception":"HadoopIllegalArgumentException","javaClassName":"org.apache.hadoop.HadoopIllegalArgumentException","message":"Invalid > argument, newValue is null"}} > {code} > This is because secure Datanode expects a delegation token, but insecure > Namenode does not return it to a client. > - org.apache.hadoop.security.token.Token.decodeWritable > {code} > private static void decodeWritable(Writable obj, > String newValue) throws IOException { > if (newValue == null) { > throw new HadoopIllegalArgumentException( > "Invalid argument, newValue is null"); > } > {code} > The issue proposes to support the access also for WebHDFS. > The attached PDF file [^webhdfs-federation.pdf] depicts our current > architecture and proposal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15423) RBF: WebHDFS create shouldn't choose DN from all sub-clusters
Chao Sun created HDFS-15423: --- Summary: RBF: WebHDFS create shouldn't choose DN from all sub-clusters Key: HDFS-15423 URL: https://issues.apache.org/jira/browse/HDFS-15423 Project: Hadoop HDFS Issue Type: Bug Components: rbf Reporter: Chao Sun In {{RouterWebHdfsMethods}} and for a {{CREATE}} call, {{chooseDatanode}} first gets all DNs via {{getDatanodeReport}}, and then randomly pick one from the list via {{getRandomDatanode}}. This logic doesn't seem correct as it should pick a DN for the specific cluster(s) of the input {{path}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15335) Report top N metrics for files in get listing ops
Chao Sun created HDFS-15335: --- Summary: Report top N metrics for files in get listing ops Key: HDFS-15335 URL: https://issues.apache.org/jira/browse/HDFS-15335 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, metrics Reporter: Chao Sun Currently HDFS has {{filesInGetListingOps}} metrics which tells the total number of files in all listing ops. However, it will be useful to report the top N users who contribute most to this. This can help to identify the potential bad users and stop the abusing against NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15029) RBF: Supporting batched listing
Chao Sun created HDFS-15029: --- Summary: RBF: Supporting batched listing Key: HDFS-15029 URL: https://issues.apache.org/jira/browse/HDFS-15029 Project: Hadoop HDFS Issue Type: Sub-task Components: rbf Reporter: Chao Sun After the work for batched listing in HDFS is implemented, we should also support the API for RBF. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15015) Backport HDFS-5040 to branch-2
Chao Sun created HDFS-15015: --- Summary: Backport HDFS-5040 to branch-2 Key: HDFS-15015 URL: https://issues.apache.org/jira/browse/HDFS-15015 Project: Hadoop HDFS Issue Type: Improvement Components: logging Reporter: Chao Sun Assignee: Chao Sun HDFS-5040 added audit logging for several admin commands which are useful for diagnosing and debugging. For instance, {{getDatanodeReport}} is an expensive call and can be invoked by components such as RBF for metrics and others. It's better to track them in audit log. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15014) [RBF] WebHdfs chooseDatanode shouldn't call getDatanodeReport
Chao Sun created HDFS-15014: --- Summary: [RBF] WebHdfs chooseDatanode shouldn't call getDatanodeReport Key: HDFS-15014 URL: https://issues.apache.org/jira/browse/HDFS-15014 Project: Hadoop HDFS Issue Type: Bug Components: rbf Reporter: Chao Sun Currently the {{chooseDatanode}} call (which is shared by {{open}}, {{create}}, {{append}} and {{getFileChecksum}}) in RBF WebHDFS calls {{getDatanodeReport}} from ALL downstream namenodes: {code} private DatanodeInfo chooseDatanode(final Router router, final String path, final HttpOpParam.Op op, final long openOffset, final String excludeDatanodes) throws IOException { // We need to get the DNs as a privileged user final RouterRpcServer rpcServer = getRPCServer(router); UserGroupInformation loginUser = UserGroupInformation.getLoginUser(); RouterRpcServer.setCurrentUser(loginUser); DatanodeInfo[] dns = null; try { dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE); } catch (IOException e) { LOG.error("Cannot get the datanodes from the RPC server", e); } finally { // Reset ugi to remote user for remaining operations. RouterRpcServer.resetCurrentUser(); } HashSet excludes = new HashSet(); if (excludeDatanodes != null) { Collection collection = getTrimmedStringCollection(excludeDatanodes); for (DatanodeInfo dn : dns) { if (collection.contains(dn.getName())) { excludes.add(dn); } } } ... {code} The {{getDatanodeReport}} is very expensive (particularly in a large cluster) as it need to lock the {{DatanodeManager}} which is also shared by calls such as processing heartbeats. Check HDFS-14366 for a similar issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15005) Backport HDFS-12300 to branch-2
Chao Sun created HDFS-15005: --- Summary: Backport HDFS-12300 to branch-2 Key: HDFS-15005 URL: https://issues.apache.org/jira/browse/HDFS-15005 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Chao Sun Assignee: Chao Sun Having DT related information is very useful in audit log. This tracks effort to backport HDFS-12300 to branch-2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-14034) Support getQuotaUsage API in WebHDFS
[ https://issues.apache.org/jira/browse/HDFS-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reopened HDFS-14034: - Re-opening this for backporting to branch-2. > Support getQuotaUsage API in WebHDFS > > > Key: HDFS-14034 > URL: https://issues.apache.org/jira/browse/HDFS-14034 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: fs, webhdfs >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14034-branch-2.000.patch, > HDFS-14034-branch-2.001.patch, HDFS-14034.000.patch, HDFS-14034.001.patch, > HDFS-14034.002.patch, HDFS-14034.004.patch > > > HDFS-8898 added support for a new API, {{getQuotaUsage}} which can fetch > quota usage on a directory with significantly lower impact than the similar > {{getContentSummary}}. This JIRA is to track adding support for this API to > WebHDFS. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14671) WebHDFS: Add erasureCodingPolicy to ContentSummary
[ https://issues.apache.org/jira/browse/HDFS-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-14671. - Resolution: Duplicate > WebHDFS: Add erasureCodingPolicy to ContentSummary > -- > > Key: HDFS-14671 > URL: https://issues.apache.org/jira/browse/HDFS-14671 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: webhdfs >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > HDFS-11647 added {{erasureCodingPolicy}} to {{ContentSummary}}. We should add > this info to the result from WebHDFS {{getContentSummary}} call as well. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14671) WebHDFS: Add erasureCodingPolicy to ContentSummary
Chao Sun created HDFS-14671: --- Summary: WebHDFS: Add erasureCodingPolicy to ContentSummary Key: HDFS-14671 URL: https://issues.apache.org/jira/browse/HDFS-14671 Project: Hadoop HDFS Issue Type: Sub-task Components: webhdfs Reporter: Chao Sun Assignee: Chao Sun HDFS-11647 added {{erasureCodingPolicy}} to {{ContentSummary}}. We should add this info to the result from WebHDFS {{getContentSummary}} call as well. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14110) NPE when serving http requests while NameNode is starting up
[ https://issues.apache.org/jira/browse/HDFS-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-14110. - Resolution: Duplicate > NPE when serving http requests while NameNode is starting up > > > Key: HDFS-14110 > URL: https://issues.apache.org/jira/browse/HDFS-14110 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.8.2 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > > In 2.8.2 we saw this exception when a security-enabled NameNode is still > loading edits: > {code:java} > 2018-11-28 00:21:02,909 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected > pause in JVM or host machine (eg GC): pause of approximately 2068ms GC pool > 'ParNew' had collection(s): count=1 time=2325ms 2018-11-28 00:21:05,768 WARN > org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR > java.lang.NullPointerException at > org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283) > at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42) > at > com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at
[jira] [Created] (HDFS-14660) [SBN Read] ObserverNameNode should throw StandbyException for requests not from ObserverProxyProvider
Chao Sun created HDFS-14660: --- Summary: [SBN Read] ObserverNameNode should throw StandbyException for requests not from ObserverProxyProvider Key: HDFS-14660 URL: https://issues.apache.org/jira/browse/HDFS-14660 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun In a HDFS HA cluster with consistent reads enabled (HDFS-12943), clients could be using either {{ObserverReadProxyProvider}}, {{ConfiguredProxyProvider}}, or something else. Since observer is just a special type of SBN and we allow transitions between them, a client NOT using {{ObserverReadProxyProvider}} will need to have {{dfs.ha.namenodes.}} include all NameNodes in the cluster, and therefore, it may send request to a observer node. For this case, we should check whether the {{stateId}} in the incoming RPC header is set or not, and throw an {{StandbyException}} when it is not. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13189) Standby NameNode should roll active edit log when checkpointing
[ https://issues.apache.org/jira/browse/HDFS-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-13189. - Resolution: Duplicate > Standby NameNode should roll active edit log when checkpointing > --- > > Key: HDFS-13189 > URL: https://issues.apache.org/jira/browse/HDFS-13189 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Reporter: Chao Sun >Priority: Minor > > When the SBN is doing checkpointing, it will hold the {{cpLock}}. In the > current implementation of edit log tailer thread, it will first check and > roll active edit log, and then tail and apply edits. In the case of > checkpointing, it will be blocked on the {{cpLock}} and will not roll the > edit log. > It seems there is no dependency between the edit log roll and tailing edits, > so a better may be to do these in separate threads. This will be helpful for > people who uses the observer feature without in-progress edit log tailing. > An alternative is to configure > {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} and > {{dfs.namenode.edit.log.autoroll.check.interval.ms}} to let ANN roll its own > log more frequently in case SBN is stuck on the lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14415) Backport HDFS-13799 to branch-2
Chao Sun created HDFS-14415: --- Summary: Backport HDFS-13799 to branch-2 Key: HDFS-14415 URL: https://issues.apache.org/jira/browse/HDFS-14415 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chao Sun Assignee: Chao Sun As multi-SBN feature is already backported to branch-2, this is a follow-up to backport HDFS-13799. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14399) Backport HDFS-10536 to branch-2
Chao Sun created HDFS-14399: --- Summary: Backport HDFS-10536 to branch-2 Key: HDFS-14399 URL: https://issues.apache.org/jira/browse/HDFS-14399 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun As multi-SBN feature is already backported to branch-2, this is a follow-up to backport HADOOP-10536. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14397) Backport HADOOP-15684 to branch-2
Chao Sun created HDFS-14397: --- Summary: Backport HADOOP-15684 to branch-2 Key: HDFS-14397 URL: https://issues.apache.org/jira/browse/HDFS-14397 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun As multi-SBN feature is already backported to branch-2, this is a follow-up to backport HADOOP-15684. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14392) Backport HDFS-9787 to branch-2
Chao Sun created HDFS-14392: --- Summary: Backport HDFS-9787 to branch-2 Key: HDFS-14392 URL: https://issues.apache.org/jira/browse/HDFS-14392 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Chao Sun Assignee: Chao Sun As multi-SBN feature is already backported to branch-2, this is a follow-up to backport HDFS-9787. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14391) Backport HDFS-9659 to branch-2
Chao Sun created HDFS-14391: --- Summary: Backport HDFS-9659 to branch-2 Key: HDFS-14391 URL: https://issues.apache.org/jira/browse/HDFS-14391 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chao Sun Assignee: Chao Sun As multi-SBN feature is already backported to branch-2, this is a follow-up to backport HDFS-9659. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14366) Improve HDFS append performance
Chao Sun created HDFS-14366: --- Summary: Improve HDFS append performance Key: HDFS-14366 URL: https://issues.apache.org/jira/browse/HDFS-14366 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Chao Sun Assignee: Chao Sun In our HDFS cluster we observed that {{append}} operation can take as much as 10X write lock time than other write operations. By collecting flamegraph on the namenode (see attachment), we found that most of the append call is spent on {{getNumLiveDataNodes()}}: {code} /** @return the number of live datanodes. */ public int getNumLiveDataNodes() { int numLive = 0; synchronized (this) { for(DatanodeDescriptor dn : datanodeMap.values()) { if (!isDatanodeDead(dn) ) { numLive++; } } } return numLive; } {code} this method synchronizes on the {{DatanodeManager}} which is particularly expensive in large clusters since {{datanodeMap}} is being modified in many places such as processing DN heartbeats. For {{append}} operation, {{getNumLiveDataNodes()}} is invoked in {{isSufficientlyReplicated}}: {code} /** * Check if a block is replicated to at least the minimum replication. */ public boolean isSufficientlyReplicated(BlockInfo b) { // Compare against the lesser of the minReplication and number of live DNs. final int replication = Math.min(minReplication, getDatanodeManager().getNumLiveDataNodes()); return countNodes(b).liveReplicas() >= replication; } {code} The way that the {{replication}} is calculated is not very optimal, as it will call {{getNumLiveDataNodes()}} every time even though usually {{minReplication}} is much smaller than the latter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14346) EditLogTailer loses precision for sub-second edit log tailing and rolling interval
Chao Sun created HDFS-14346: --- Summary: EditLogTailer loses precision for sub-second edit log tailing and rolling interval Key: HDFS-14346 URL: https://issues.apache.org/jira/browse/HDFS-14346 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Chao Sun Assignee: Chao Sun {{EditLogTailer}} currently uses the following: {code} logRollPeriodMs = conf.getTimeDuration( DFSConfigKeys.DFS_HA_LOGROLL_PERIOD_KEY, DFSConfigKeys.DFS_HA_LOGROLL_PERIOD_DEFAULT, TimeUnit.SECONDS) * 1000; sleepTimeMs = conf.getTimeDuration( DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY, DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_DEFAULT, TimeUnit.SECONDS) * 1000; {code} to determine the edit log roll and tail frequency. However, if user specify sub-second frequency, such as {{100ms}}, it will lose precision and become 0s. This is not ideal for some scenarios such as standby reads (HDFS-12943). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes
Chao Sun created HDFS-14305: --- Summary: Serial number in BlockTokenSecretManager could overlap between different namenodes Key: HDFS-14305 URL: https://issues.apache.org/jira/browse/HDFS-14305 Project: Hadoop HDFS Issue Type: Improvement Components: security Reporter: Chao Sun Assignee: Chao Sun Currently, a {{BlockTokenSecretManager}} starts with a random integer as the initial serial number, and then use this formula to rotate it: {code:java} this.intRange = Integer.MAX_VALUE / numNNs; this.nnRangeStart = intRange * nnIndex; this.serialNo = (this.serialNo % intRange) + (nnRangeStart); {code} while {{numNNs}} is the total number of NameNodes in the cluster, and {{nnIndex}} is the index of the current NameNode specified in the configuration {{dfs.ha.namenodes.}}. However, with this approach, different NameNode could have overlapping ranges for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges for these two are: {code} nn1 -> [-49, 49] nn2 -> [1, 99] {code} This is because the initial serial number could be any negative integer. Moreover, when the keys are updated, the serial number will again be updated with the formula: {code} this.serialNo = (this.serialNo % intRange) + (nnRangeStart); {code} which means the new serial number could be updated to a range that belongs to a different NameNode, thus increasing the chance of collision again. When the collision happens, DataNodes could overwrite an existing key which will cause clients to fail because of {{InvalidToken}} error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14250) [Standby Reads] msync should sync with active NameNode to fetch the latest stateID
Chao Sun created HDFS-14250: --- Summary: [Standby Reads] msync should sync with active NameNode to fetch the latest stateID Key: HDFS-14250 URL: https://issues.apache.org/jira/browse/HDFS-14250 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Chao Sun Assignee: Chao Sun Currently the {{msync}} call is a dummy operation to observer without really syncing. Instead, it should: # Get the latest stateID from active NN. # Use the stateID to talk to observer NN and make sure it is synced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14168) Fix TestWebHdfsTimeouts
[ https://issues.apache.org/jira/browse/HDFS-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-14168. - Resolution: Duplicate > Fix TestWebHdfsTimeouts > --- > > Key: HDFS-14168 > URL: https://issues.apache.org/jira/browse/HDFS-14168 > Project: Hadoop HDFS > Issue Type: Test > Components: webhdfs >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > The test TestWebHdfsTimeouts keep failing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14168) Fix TestWebHdfsTimeouts
Chao Sun created HDFS-14168: --- Summary: Fix TestWebHdfsTimeouts Key: HDFS-14168 URL: https://issues.apache.org/jira/browse/HDFS-14168 Project: Hadoop HDFS Issue Type: Test Components: webhdfs Reporter: Chao Sun Assignee: Chao Sun The test TestWebHdfsTimeouts keep failing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14154) Add explanation for dfs.ha.tail-edits.period in user guide.
Chao Sun created HDFS-14154: --- Summary: Add explanation for dfs.ha.tail-edits.period in user guide. Key: HDFS-14154 URL: https://issues.apache.org/jira/browse/HDFS-14154 Project: Hadoop HDFS Issue Type: New Feature Components: documentation Reporter: Chao Sun Assignee: Chao Sun We should document {{dfs.ha.tail-edits.period}} in the user guide. The default value is too large for {{ObserverReadProxyProvider}} and we should recommend a value. We can also address some remaining issues in HDFS-14131. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14146) Handle exception from internalQueueCall
Chao Sun created HDFS-14146: --- Summary: Handle exception from internalQueueCall Key: HDFS-14146 URL: https://issues.apache.org/jira/browse/HDFS-14146 Project: Hadoop HDFS Issue Type: Sub-task Components: ipc Reporter: Chao Sun Assignee: Chao Sun When we re-queue RPC call, the {{internalQueueCall}} will potentially throw exceptions (e.g., RPC backoff), which is then swallowed. This will cause the RPC to be silently discarded without response to the client, which is not good. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14110) NPE when serving http requests while NameNode is starting up
Chao Sun created HDFS-14110: --- Summary: NPE when serving http requests while NameNode is starting up Key: HDFS-14110 URL: https://issues.apache.org/jira/browse/HDFS-14110 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 2.8.2 Reporter: Chao Sun Assignee: Chao Sun In 2.8.2 we saw this exception when a security-enabled NameNode is still loading edits: {code:java} 2018-11-28 00:21:02,909 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2068ms GC pool 'ParNew' had collection(s): count=1 time=2325ms 2018-11-28 00:21:05,768 WARN org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR java.lang.NullPointerException at org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283) at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) at org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54) at org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42) at com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) {code} Looking at the code, this is where the NPE happened (the line with
[jira] [Created] (HDFS-14067) Allow manual failover between standby and observer
Chao Sun created HDFS-14067: --- Summary: Allow manual failover between standby and observer Key: HDFS-14067 URL: https://issues.apache.org/jira/browse/HDFS-14067 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Assignee: Chao Sun Currently if automatic failover is enabled in a HA environment, transition from standby to observer would be blocked: {code} [hdfs@*** hadoop-3.3.0-SNAPSHOT]$ bin/hdfs haadmin -transitionToObserver ha2 Automatic failover is enabled for NameNode at Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the --forcemanual flag. {code} We should allow manual transition between standby and observer in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13931) Backport HDFS-6440 to branch 2.8.2
Chao Sun created HDFS-13931: --- Summary: Backport HDFS-6440 to branch 2.8.2 Key: HDFS-13931 URL: https://issues.apache.org/jira/browse/HDFS-13931 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun Currently HDFS-6440 is only in branch-3. This aims at backporting it to branch-2.8.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13924) Handle BlockMissingException when reading from observer
Chao Sun created HDFS-13924: --- Summary: Handle BlockMissingException when reading from observer Key: HDFS-13924 URL: https://issues.apache.org/jira/browse/HDFS-13924 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Internally we found that reading from ObserverNode may result to {{BlockMissingException}}. This may happen when the observer sees a smaller number of DNs than active (maybe due to communication issue with those DNs), or (we guess) late block reports from some DNs to the observer. This error happens in [DFSInputStream#chooseDataNode|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L846], when no valid DN can be found for the {{LocatedBlock}} got from the NN side. One potential solution (although a little hacky) is to ask the {{DFSInputStream}} to retry active when this happens. The retry logic already present in the code - we just have to dynamically set a flag to ask the {{ObserverReadProxyProvider}} try active in this case. cc [~shv], [~xkrogen], [~vagarychen], [~zero45] for discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13898) Throw retriable exception for getBlockLocations when ObserverNameNode is in safemode
Chao Sun created HDFS-13898: --- Summary: Throw retriable exception for getBlockLocations when ObserverNameNode is in safemode Key: HDFS-13898 URL: https://issues.apache.org/jira/browse/HDFS-13898 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun When ObserverNameNode is in safe mode, {{getBlockLocations}} may throw safe mode exception if the given file doesn't have any block yet. {code} try { checkOperation(OperationCategory.READ); res = FSDirStatAndListingOp.getBlockLocations( dir, pc, srcArg, offset, length, true); if (isInSafeMode()) { for (LocatedBlock b : res.blocks.getLocatedBlocks()) { // if safemode & no block locations yet then throw safemodeException if ((b.getLocations() == null) || (b.getLocations().length == 0)) { SafeModeException se = newSafemodeException( "Zero blocklocations for " + srcArg); if (haEnabled && haContext != null && haContext.getState().getServiceState() == HAServiceState.ACTIVE) { throw new RetriableException(se); } else { throw se; } } } } {code} It only throws {{RetriableException}} for active NN so requests on observer may just fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13814) Remove super user privilege requirement for HAServiceProtocol#getServiceStatus
Chao Sun created HDFS-13814: --- Summary: Remove super user privilege requirement for HAServiceProtocol#getServiceStatus Key: HDFS-13814 URL: https://issues.apache.org/jira/browse/HDFS-13814 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chao Sun Assignee: Chao Sun See details [in the discussion of HDFS-13749|https://issues.apache.org/jira/browse/HDFS-13749?focusedCommentId=16568693=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16568693]. Currently HAServiceProtocol#getServiceStatus requires super user privilege, which doesn't seem necessary. For comparison: {{DFSAdmin#report}}, as well as {{SAFEMODE_GET}}, doesn't require super privilege. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13792) Fix FSN read/write lock metrics name
Chao Sun created HDFS-13792: --- Summary: Fix FSN read/write lock metrics name Key: HDFS-13792 URL: https://issues.apache.org/jira/browse/HDFS-13792 Project: Hadoop HDFS Issue Type: Bug Components: documentation, metrics Environment: The metrics name for FSN read/write lock should be in the format: {code} FSN(Read|Write)Lock`*OperationName*`NaNosNumOps{code} {code} not {code} FSN(Read|Write)Lock`*OperationName*`NumOps{code} {code} Reporter: Chao Sun Assignee: Chao Sun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13749) Implement a new client protocol method to get NameNode state
Chao Sun created HDFS-13749: --- Summary: Implement a new client protocol method to get NameNode state Key: HDFS-13749 URL: https://issues.apache.org/jira/browse/HDFS-13749 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Assignee: Chao Sun Currently {{HAServiceProtocol#getServiceStatus}} requires super user privilege. Therefore, as a temporary solution, in HDFS-12976 we discover NameNode state by calling {{reportBadBlocks}}. Here, we'll properly implement this by adding a new method in client protocol to get the NameNode state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13735) Make QJM HTTP URL connection timeout configurable
Chao Sun created HDFS-13735: --- Summary: Make QJM HTTP URL connection timeout configurable Key: HDFS-13735 URL: https://issues.apache.org/jira/browse/HDFS-13735 Project: Hadoop HDFS Issue Type: Improvement Components: qjm Reporter: Chao Sun Assignee: Chao Sun We've seen "connect timed out" happen internally when QJM tries to open HTTP connections to JNs. This is now using {{newDefaultURLConnectionFactory}} which uses the default timeout 60s, and is not configurable. It would be better for this to be configurable, especially for ObserverNameNode (HDFS-12943), where latency is important, and 60s may not be a good value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13687) ConfiguredFailoverProxyProvider could direct requests to SBN
Chao Sun created HDFS-13687: --- Summary: ConfiguredFailoverProxyProvider could direct requests to SBN Key: HDFS-13687 URL: https://issues.apache.org/jira/browse/HDFS-13687 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun In case there are multiple SBNs, and {{dfs.ha.allow.stale.reads}} is set to true, failover could go to a SBN which then may serve read requests from client. This may not be the expected behavior. This issue arises when we are working on HDFS-12943 and HDFS-12976. A better approach for this could be to check {{HAServiceState}} and find out the active NN when performing failover. This also can reduce the # of failovers the client has to do in case of multiple SBNs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13674) Improve documentation on Metrics
Chao Sun created HDFS-13674: --- Summary: Improve documentation on Metrics Key: HDFS-13674 URL: https://issues.apache.org/jira/browse/HDFS-13674 Project: Hadoop HDFS Issue Type: Improvement Components: documentation, metrics Reporter: Chao Sun Assignee: Chao Sun There are a few confusing places in the [Hadoop Metrics page|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html]. For instance, there are duplicated entries such as {{FsImageLoadTime}}; some quantile metrics do not have corresponding entries, description on some quantile metrics are not very specific on what is the {{num}} variable in the metrics name, etc. This JIRA targets at improving this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13664) Refactor ConfiguredFailoverProxyProvider to make inheritance easier
Chao Sun created HDFS-13664: --- Summary: Refactor ConfiguredFailoverProxyProvider to make inheritance easier Key: HDFS-13664 URL: https://issues.apache.org/jira/browse/HDFS-13664 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Reporter: Chao Sun Assignee: Chao Sun In HDFS-12943 we'd like to introduce a new proxy provider that inherits {{ConfiguredFailoverProvider}}. Some refactoring is necessary to allow easier code sharing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13641) Add metrics for edit log tailin
Chao Sun created HDFS-13641: --- Summary: Add metrics for edit log tailin Key: HDFS-13641 URL: https://issues.apache.org/jira/browse/HDFS-13641 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13600) Add toString() for RemoteMethod
[ https://issues.apache.org/jira/browse/HDFS-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-13600. - Resolution: Duplicate > Add toString() for RemoteMethod > --- > > Key: HDFS-13600 > URL: https://issues.apache.org/jira/browse/HDFS-13600 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Minor > > Saw messages like: > {code} > 2018-05-21 18:23:19,011 ERROR > org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient: Invocation > to "XXX" for > "org.apache.hadoop.hdfs.server.federation.router.RemoteMethod@390c38d2" timed > out > {code} > I think {{RemoteMethod}} needs a {{toString}} method. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13600) Add toString() for RemoteMethod
Chao Sun created HDFS-13600: --- Summary: Add toString() for RemoteMethod Key: HDFS-13600 URL: https://issues.apache.org/jira/browse/HDFS-13600 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Assignee: Chao Sun Saw messages like: {code} 2018-05-21 18:23:19,011 ERROR org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient: Invocation to "XXX" for "org.apache.hadoop.hdfs.server.federation.router.RemoteMethod@390c38d2" timed out {code} I think {{RemoteMethod}} needs a {{toString}} method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13578) Add ReadOnly annotation to methods in ClientProtocol
Chao Sun created HDFS-13578: --- Summary: Add ReadOnly annotation to methods in ClientProtocol Key: HDFS-13578 URL: https://issues.apache.org/jira/browse/HDFS-13578 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Assignee: Chao Sun For those read-only methods in {{ClientProtocol}}, we may want to use a {{@ReadOnly}} annotation to mark them, and then check in the proxy provider for observer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13286) Add haadmin commands to transition between standby and observer
Chao Sun created HDFS-13286: --- Summary: Add haadmin commands to transition between standby and observer Key: HDFS-13286 URL: https://issues.apache.org/jira/browse/HDFS-13286 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chao Sun Assignee: Chao Sun As discussed in HDFS-12975, we should allow explicit transition between standby and observer through haadmin command, such as: {code} haadmin -transitionToObserver {code} Initially we should support transition from observer to standby, and standby to observer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13266) Fix TestWebHdfsTimeouts
[ https://issues.apache.org/jira/browse/HDFS-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-13266. - Resolution: Duplicate > Fix TestWebHdfsTimeouts > --- > > Key: HDFS-13266 > URL: https://issues.apache.org/jira/browse/HDFS-13266 > Project: Hadoop HDFS > Issue Type: Bug > Components: test, webhdfs >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > {{TestWebHdfsTimeouts}} fails on Linux, in my case, > {code} > Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian > 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016) > {code} > However, the test succeeds on Mac. It seems this is due to a change on > backlog queue implementation since Linux 2.2. See > http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. > Therefore, > [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353] > doesn't work as intended under Linux. We should figure out a way to fix this > or disable the relevant tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-13267) Fix TestWebHdfsTimeouts
[ https://issues.apache.org/jira/browse/HDFS-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved HDFS-13267. - Resolution: Duplicate Oops. Created this twice because of JIRA lag. Closing this one now. > Fix TestWebHdfsTimeouts > --- > > Key: HDFS-13267 > URL: https://issues.apache.org/jira/browse/HDFS-13267 > Project: Hadoop HDFS > Issue Type: Bug > Components: test, webhdfs >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > {{TestWebHdfsTimeouts}} fails on Linux, in my case, > {code} > Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian > 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016) > {code} > However, the test succeeds on Mac. It seems this is due to a change on > backlog queue implementation since Linux 2.2. See > http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. > Therefore, > [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353] > doesn't work as intended under Linux. We should figure out a way to fix this > or disable the relevant tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13267) Fix TestWebHdfsTimeouts
Chao Sun created HDFS-13267: --- Summary: Fix TestWebHdfsTimeouts Key: HDFS-13267 URL: https://issues.apache.org/jira/browse/HDFS-13267 Project: Hadoop HDFS Issue Type: Bug Components: test, webhdfs Reporter: Chao Sun Assignee: Chao Sun {{TestWebHdfsTimeouts}} fails on Linux, in my case, {code} Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016) {code} However, the test succeeds on Mac. It seems this is due to a change on backlog queue implementation since Linux 2.2. See http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. Therefore, [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353] doesn't work as intended under Linux. We should figure out a way to fix this or disable the relevant tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13266) Fix TestWebHdfsTimeouts
Chao Sun created HDFS-13266: --- Summary: Fix TestWebHdfsTimeouts Key: HDFS-13266 URL: https://issues.apache.org/jira/browse/HDFS-13266 Project: Hadoop HDFS Issue Type: Bug Components: test, webhdfs Reporter: Chao Sun Assignee: Chao Sun {{TestWebHdfsTimeouts}} fails on Linux, in my case, {code} Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016) {code} However, the test succeeds on Mac. It seems this is due to a change on backlog queue implementation since Linux 2.2. See http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. Therefore, [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353] doesn't work as intended under Linux. We should figure out a way to fix this or disable the relevant tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13202) Fix javadoc in HAUtil and small refactoring
Chao Sun created HDFS-13202: --- Summary: Fix javadoc in HAUtil and small refactoring Key: HDFS-13202 URL: https://issues.apache.org/jira/browse/HDFS-13202 Project: Hadoop HDFS Issue Type: Bug Reporter: Chao Sun Assignee: Chao Sun There are a few outdated javadocs in {{HAUtil}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13189) Standby NameNode should roll active edit log when checkpointing
Chao Sun created HDFS-13189: --- Summary: Standby NameNode should roll active edit log when checkpointing Key: HDFS-13189 URL: https://issues.apache.org/jira/browse/HDFS-13189 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Chao Sun When the SBN is doing checkpointing, it will hold the {{cpLock}}. In the current implementation of edit log tailer thread, it will first check and roll active edit log, and then tail and apply edits. In the case of checkpointing, it will be blocked on the {{cpLock}} and will not roll the edit log. It seems there is no dependency between the edit log roll and tailing edits, so a better may be to do these in separate threads. This will be helpful for people who uses the observer feature without in-progress edit log tailing. An alternative is to configure {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} and {{dfs.namenode.edit.log.autoroll.check.interval.ms}} to let ANN roll its own log more frequently in case SBN is stuck on the lock. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13182) Allow Observer to participate in NameNode failover
Chao Sun created HDFS-13182: --- Summary: Allow Observer to participate in NameNode failover Key: HDFS-13182 URL: https://issues.apache.org/jira/browse/HDFS-13182 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, namenode Reporter: Chao Sun As discussed in the design doc, when there is no SBN available, Observer should be eligible for namenode failover. See HDFS-12975 for some preliminary findings in this effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13152) Support observer reads for WebHDFS
Chao Sun created HDFS-13152: --- Summary: Support observer reads for WebHDFS Key: HDFS-13152 URL: https://issues.apache.org/jira/browse/HDFS-13152 Project: Hadoop HDFS Issue Type: Sub-task Components: ha, namenode, webhdfs Reporter: Chao Sun In the case of WebHDFS, a designated DN will launch a DFSClient and stream data to the WebHDFS client. The DFSClient will read the local configuration files (e.g., hdfs-site.xml) on the DN. So, in this case changes may be required for WebHDFS to enable observer reads. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-13145) SBN crash when transition to ANN with in-progress edit tailing enabled
Chao Sun created HDFS-13145: --- Summary: SBN crash when transition to ANN with in-progress edit tailing enabled Key: HDFS-13145 URL: https://issues.apache.org/jira/browse/HDFS-13145 Project: Hadoop HDFS Issue Type: Bug Components: ha, namenode Reporter: Chao Sun Assignee: Chao Sun With edit log in-progress edit log tailing enabled, {{QuorumOutputStream}} will send two batches to JNs, one normal edit batch followed by a dummy batch to update the commit ID on JNs. {code} QuorumCallqcall = loggers.sendEdits( segmentTxId, firstTxToFlush, numReadyTxns, data); loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits"); // Since we successfully wrote this batch, let the loggers know. Any future // RPCs will thus let the loggers know of the most recent transaction, even // if a logger has fallen behind. loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1); // If we don't have this dummy send, committed TxId might be one-batch // stale on the Journal Nodes if (updateCommittedTxId) { QuorumCall fakeCall = loggers.sendEdits( segmentTxId, firstTxToFlush, 0, new byte[0]); loggers.waitForWriteQuorum(fakeCall, writeTimeoutMs, "sendEdits"); } {code} Between each batch, it will wait for the JNs to reach a quorum. However, if the ANN crashes in between, then SBN will crash while transiting to ANN: {code} java.lang.IllegalStateException: Cannot start writing at txid 24312595802 when there is a stream available for read: .. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:329) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1196) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1839) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64) at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1707) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1622) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:851) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:794) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2490) 2018-02-13 00:43:20,728 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} This is because without the dummy batch, the {{commitTxnId}} will lag behind the {{endTxId}}, which caused the check in {{openForWrite}} to fail: {code} List streams = new ArrayList(); journalSet.selectInputStreams(streams, segmentTxId, true, false); if (!streams.isEmpty()) { String error = String.format("Cannot start writing at txid %s " + "when there is a stream available for read: %s", segmentTxId, streams.get(0)); IOUtils.cleanupWithLogger(LOG, streams.toArray(new EditLogInputStream[0])); throw new IllegalStateException(error); } {code} In our environment, this can be reproduced pretty consistently, which will leave the cluster with no running namenodes. Even though we are using a 2.8.2 backport, I believe the same issue also exist in 3.0.x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12932) Confusing LOG message for block replication
Chao Sun created HDFS-12932: --- Summary: Confusing LOG message for block replication Key: HDFS-12932 URL: https://issues.apache.org/jira/browse/HDFS-12932 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 2.8.3 Reporter: Chao Sun Assignee: Chao Sun Priority: Minor In our cluster we see large number of log messages such as the following: {code} 2017-12-15 22:55:54,603 INFO org.apache.hadoop.hdfs.server.namenode.FSDirectory: Increasing replication from 3 to 3 for {code} This is a little confusing since "from 3 to 3" is not "increasing". Digging into it, it seems related to this piece of code: {code} if (oldBR != -1) { if (oldBR > targetReplication) { FSDirectory.LOG.info("Decreasing replication from {} to {} for {}", oldBR, targetReplication, iip.getPath()); } else { FSDirectory.LOG.info("Increasing replication from {} to {} for {}", oldBR, targetReplication, iip.getPath()); } } {code} Perhaps a {{oldBR == targetReplication}} case is missing? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-12836) startTxId could be greater than endTxId when tailing in-progress edit log
Chao Sun created HDFS-12836: --- Summary: startTxId could be greater than endTxId when tailing in-progress edit log Key: HDFS-12836 URL: https://issues.apache.org/jira/browse/HDFS-12836 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Chao Sun Assignee: Chao Sun When {{dfs.ha.tail-edits.in-progress}} is true, edit log tailer will also tail those in progress edit log segments. However, in the following code: {code} if (onlyDurableTxns && inProgressOk) { endTxId = Math.min(endTxId, committedTxnId); } EditLogInputStream elis = EditLogFileInputStream.fromUrl( connectionFactory, url, remoteLog.getStartTxId(), endTxId, remoteLog.isInProgress()); {code} it is possible that {{remoteLog.getStartTxId()}} could be greater than {{endTxId}}, and therefore will cause the following error: {code} 2017-11-17 19:55:41,165 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: Error replaying edit log at offset 1048576. Expected transaction ID was 87 Recent opcode offsets: 1048576 org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 86; expected file to go up to 85 at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:197) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:189) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393) 2017-11-17 19:55:41,165 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Error while reading edits from disk. Will try again. org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 1048576. Expected transaction ID was 87 Recent opcode offsets: 1048576 at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:218) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393) Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 86; expected file to go up to 85 at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:197) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:189) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205) ... 9
[jira] [Created] (HDFS-12669) Implement toString() for EditLogInputStream
Chao Sun created HDFS-12669: --- Summary: Implement toString() for EditLogInputStream Key: HDFS-12669 URL: https://issues.apache.org/jira/browse/HDFS-12669 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Chao Sun Priority: Minor Currently {{EditLogInputStream}} has {{getName()}} but doesn't implement {{toString()}}. The latter could be useful in debugging. Currently it just print out messages like: {code} 2017-10-16 20:41:13,456 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@1eb6749b expecting start txid #8137 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org