[jira] [Resolved] (HDFS-16686) GetJournalEditServlet fails to authorize valid Kerberos request

2022-09-13 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-16686.
-
Fix Version/s: 3.3.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

> GetJournalEditServlet fails to authorize valid Kerberos request
> ---
>
> Key: HDFS-16686
> URL: https://issues.apache.org/jira/browse/HDFS-16686
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node
>Affects Versions: 3.4.0, 3.3.9
> Environment: Running in Kubernetes using Java 11 in an HA 
> configuration.  JournalNodes run on separate pods and have their own Kerberos 
> principal "jn/@".
>Reporter: Steve Vaughan
>Assignee: Steve Vaughan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> GetJournalEditServlet uses request.getRemoteuser() to determine the 
> remoteShortName for Kerberos authorization, which fails to match when the 
> JournalNode uses its own Kerberos principal (e.g. jn/@).
> This can be fixed by using the UserGroupInformation provided by the base 
> DfsServlet class using the getUGI(request, conf) call.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-4043) Namenode Kerberos Login does not use proper hostname for host qualified hdfs principal name.

2022-08-17 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-4043.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Namenode Kerberos Login does not use proper hostname for host qualified hdfs 
> principal name.
> 
>
> Key: HDFS-4043
> URL: https://issues.apache.org/jira/browse/HDFS-4043
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 2.0.3-alpha, 
> 3.4.0, 3.3.9
> Environment: CDH4U1 on Ubuntu 12.04
>Reporter: Ahad Rana
>Assignee: Steve Vaughan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>   Original Estimate: 24h
>  Time Spent: 50m
>  Remaining Estimate: 23h 10m
>
> The Namenode uses the loginAsNameNodeUser method in NameNode.java to login 
> using the hdfs principal. This method in turn invokes SecurityUtil.login with 
> a hostname (last parameter) obtained via a call to InetAddress.getHostName. 
> This call does not always return the fully qualified host name, and thus 
> causes the namenode to login to fail due to kerberos's inability to find a 
> matching hdfs principal in the hdfs.keytab file. Instead it should use 
> InetAddress.getCanonicalHostName. This is consistent with what is used 
> internally by SecurityUtil.java to login in other services, such as the 
> DataNode. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16702) MiniDFSCluster should report cause of exception in assertion error

2022-08-11 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-16702.
-
Hadoop Flags: Reviewed
  Resolution: Fixed

> MiniDFSCluster should report cause of exception in assertion error
> --
>
> Key: HDFS-16702
> URL: https://issues.apache.org/jira/browse/HDFS-16702
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment: Tests running in the Hadoop dev environment image.
>Reporter: Steve Vaughan
>Assignee: Steve Vaughan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> When the MiniDFSClsuter detects that an exception caused an exit, it should 
> include that exception as the cause for the AssertionError that it throws.  
> The current AssertError simply reports the message "Test resulted in an 
> unexpected exit" and provides a stack trace to the location of the check for 
> an exit exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16507) [SBN read] Avoid purging edit log which is in progress

2022-03-31 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-16507.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [SBN read] Avoid purging edit log which is in progress
> --
>
> Key: HDFS-16507
> URL: https://issues.apache.org/jira/browse/HDFS-16507
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: tomscut
>Assignee: tomscut
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> We introduced [Standby Read] feature in branch-3.1.0, but found a FATAL 
> exception. It looks like it's purging edit logs which is in process.
> According to the analysis, I suspect that the editlog which is in progress to 
> be purged(after SNN checkpoint) does not finalize(See HDFS-14317) before ANN 
> rolls edit its self. 
> The stack:
> {code:java}
> java.lang.Thread.getStackTrace(Thread.java:1552)
>     org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
>     
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager.purgeLogsOlderThan(FileJournalManager.java:185)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet$5.apply(JournalSet.java:623)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:388)
>     
> org.apache.hadoop.hdfs.server.namenode.JournalSet.purgeLogsOlderThan(JournalSet.java:620)
>     
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.purgeLogsOlderThan(FSEditLog.java:1512)
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:177)
>     
> org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:1249)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:617)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet$2.run(ImageServlet.java:516)
>     java.security.AccessController.doPrivileged(Native Method)
>     javax.security.auth.Subject.doAs(Subject.java:422)
>     
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>     
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:515)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>     javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>     
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>     
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>     
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>     
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>     
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>     org.eclipse.jetty.server.Server.handle(Server.java:539)
>     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>     
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>     
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>     
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>     
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>     
> 

[jira] [Resolved] (HDFS-16410) Insecure Xml parsing in OfflineEditsXmlLoader

2022-01-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-16410.
-
Fix Version/s: 3.4.0
   3.3.2
   Resolution: Fixed

> Insecure Xml parsing in OfflineEditsXmlLoader 
> --
>
> Key: HDFS-16410
> URL: https://issues.apache.org/jira/browse/HDFS-16410
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available, security
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Insecure Xml parsing in OfflineEditsXmlLoader 
> [https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineEditsViewer/OfflineEditsXmlLoader.java#L88]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15754) Create packet metrics for DataNode

2021-01-07 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-15754.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Create packet metrics for DataNode
> --
>
> Key: HDFS-15754
> URL: https://issues.apache.org/jira/browse/HDFS-15754
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> In BlockReceiver, right now when there is slowness in writeToMirror, 
> writeToDisk and writeToOsCache, it is dumped in the debug log. In practice we 
> have found these are quite useful signal to detect issues in DataNode, so it 
> will be great these metrics can be exposed by JMX.
> Also we introduced totalPacket received to use a percentage as a signal to 
> detect the potentially underperforming datanode since datanodes across one 
> HDFS cluster may received different numbers of packets totally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15708) TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and branch-3.2

2020-12-04 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-15708.
-
Fix Version/s: 3.3.1
   3.2.2
 Hadoop Flags: Reviewed
   Resolution: Fixed

> TestURLConnectionFactory fails by NoClassDefFoundError in branch-3.3 and 
> branch-3.2
> ---
>
> Key: HDFS-15708
> URL: https://issues.apache.org/jira/browse/HDFS-15708
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Chao Sun
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 3.2.2, 3.3.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> TestURLConnectionFactory#testSSLFactoryCleanup fails:
> {noformat}
> [ERROR] 
> testSSLFactoryCleanup(org.apache.hadoop.hdfs.web.TestURLConnectionFactory)  
> Time elapsed: 0.28 s  <<< ERROR!
> java.lang.NoClassDefFoundError: 
> org/bouncycastle/x509/X509V1CertificateGenerator
> at 
> org.apache.hadoop.security.ssl.KeyStoreTestUtil.generateCertificate(KeyStoreTestUtil.java:86)
> at 
> org.apache.hadoop.security.ssl.KeyStoreTestUtil.setupSSLConfig(KeyStoreTestUtil.java:273)
> at 
> org.apache.hadoop.security.ssl.KeyStoreTestUtil.setupSSLConfig(KeyStoreTestUtil.java:228)
> at 
> org.apache.hadoop.hdfs.web.TestURLConnectionFactory.testSSLFactoryCleanup(TestURLConnectionFactory.java:83)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: java.lang.ClassNotFoundException: 
> org.bouncycastle.x509.X509V1CertificateGenerator
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> ... 29 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15690) Add lz4-java as hadoop-hdfs test dependency

2020-11-21 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-15690.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. I'll backport this to 3.3 branch together with HADOOP-17292 
later.

> Add lz4-java as hadoop-hdfs test dependency
> ---
>
> Key: HDFS-15690
> URL: https://issues.apache.org/jira/browse/HDFS-15690
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: L. C. Hsieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> TestFSImage.testNativeCompression fails with "java.lang.NoClassDefFoundError: 
> net/jpountz/lz4/LZ4Factory":
> https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/305/testReport/junit/org.apache.hadoop.hdfs.server.namenode/TestFSImage/testNativeCompression/
> We need to add lz4-java to hadoop-hdfs test dependency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2020-09-25 Thread Chao Sun (Jira)
Chao Sun created HDFS-15601:
---

 Summary: Batch listing: gracefully fallback to use non-batched 
listing when NameNode doesn't support the feature
 Key: HDFS-15601
 URL: https://issues.apache.org/jira/browse/HDFS-15601
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: Chao Sun


HDFS-13616 requires both server and client side change. However, it is common 
that users use a newer client to talk to older HDFS (say 2.10). Currently the 
client will simply fail in this scenario. A better approach, perhaps, is to 
have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15014) RBF: WebHdfs chooseDatanode shouldn't call getDatanodeReport

2020-07-29 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-15014.
-
Resolution: Duplicate

> RBF: WebHdfs chooseDatanode shouldn't call getDatanodeReport 
> -
>
> Key: HDFS-15014
> URL: https://issues.apache.org/jira/browse/HDFS-15014
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Chao Sun
>Priority: Major
>
> Currently the {{chooseDatanode}} call (which is shared by {{open}}, 
> {{create}}, {{append}} and {{getFileChecksum}}) in RBF WebHDFS calls 
> {{getDatanodeReport}} from ALL downstream namenodes:
> {code}
>   private DatanodeInfo chooseDatanode(final Router router,
>   final String path, final HttpOpParam.Op op, final long openOffset,
>   final String excludeDatanodes) throws IOException {
> // We need to get the DNs as a privileged user
> final RouterRpcServer rpcServer = getRPCServer(router);
> UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
> RouterRpcServer.setCurrentUser(loginUser);
> DatanodeInfo[] dns = null;
> try {
>   dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
> } catch (IOException e) {
>   LOG.error("Cannot get the datanodes from the RPC server", e);
> } finally {
>   // Reset ugi to remote user for remaining operations.
>   RouterRpcServer.resetCurrentUser();
> }
> HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>   Collection collection =
>   getTrimmedStringCollection(excludeDatanodes);
>   for (DatanodeInfo dn : dns) {
> if (collection.contains(dn.getName())) {
>   excludes.add(dn);
> }
>   }
> }
> ...
> {code}
> The {{getDatanodeReport}} is very expensive (particularly in a large cluster) 
> as it need to lock the {{DatanodeManager}} which is also shared by calls such 
> as processing heartbeats. Check HDFS-14366 for a similar issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15465) Support WebHDFS accesses to the data stored in secure Datanode through insecure Namenode

2020-07-27 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-15465.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> Support WebHDFS accesses to the data stored in secure Datanode through 
> insecure Namenode
> 
>
> Key: HDFS-15465
> URL: https://issues.apache.org/jira/browse/HDFS-15465
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: federation, webhdfs
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: webhdfs-federation.pdf
>
>
> We're federating a secure HDFS cluster with an insecure cluster.
> Using HDFS RPC, we can access the data managed by insecure Namenode and 
> stored in secure Datanode.
> However, it does not work for WebHDFS due to HadoopIllegalArgumentException.
> {code}
> $ curl -i "http://:/webhdfs/v1/?op=OPEN"
> HTTP/1.1 307 TEMPORARY_REDIRECT
> (omitted)
> Location: 
> http://:/webhdfs/v1/?op=OPEN==0
> $ curl -i 
> "http://:/webhdfs/v1/?op=OPEN==0"
> HTTP/1.1 400 Bad Request
> (omitted)
> {"RemoteException":{"exception":"HadoopIllegalArgumentException","javaClassName":"org.apache.hadoop.HadoopIllegalArgumentException","message":"Invalid
>  argument, newValue is null"}}
> {code}
> This is because secure Datanode expects a delegation token, but insecure 
> Namenode does not return it to a client.
> - org.apache.hadoop.security.token.Token.decodeWritable
> {code}
>   private static void decodeWritable(Writable obj,
>  String newValue) throws IOException {
> if (newValue == null) {
>   throw new HadoopIllegalArgumentException(
>   "Invalid argument, newValue is null");
> }
> {code}
> The issue proposes to support the access also for WebHDFS.
> The attached PDF file [^webhdfs-federation.pdf] depicts our current 
> architecture and proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15423) RBF: WebHDFS create shouldn't choose DN from all sub-clusters

2020-06-19 Thread Chao Sun (Jira)
Chao Sun created HDFS-15423:
---

 Summary: RBF: WebHDFS create shouldn't choose DN from all 
sub-clusters
 Key: HDFS-15423
 URL: https://issues.apache.org/jira/browse/HDFS-15423
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Reporter: Chao Sun


In {{RouterWebHdfsMethods}} and for a {{CREATE}} call, {{chooseDatanode}} first 
gets all DNs via {{getDatanodeReport}}, and then randomly pick one from the 
list via {{getRandomDatanode}}. This logic doesn't seem correct as it should 
pick a DN for the specific cluster(s) of the input {{path}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15335) Report top N metrics for files in get listing ops

2020-05-05 Thread Chao Sun (Jira)
Chao Sun created HDFS-15335:
---

 Summary: Report top N metrics for files in get listing ops
 Key: HDFS-15335
 URL: https://issues.apache.org/jira/browse/HDFS-15335
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, metrics
Reporter: Chao Sun


Currently HDFS has {{filesInGetListingOps}} metrics which tells the total 
number of files in all listing ops. However, it will be useful to report the 
top N users who contribute most to this. This can help to identify the 
potential bad users and stop the abusing against NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15029) RBF: Supporting batched listing

2019-12-02 Thread Chao Sun (Jira)
Chao Sun created HDFS-15029:
---

 Summary: RBF: Supporting batched listing
 Key: HDFS-15029
 URL: https://issues.apache.org/jira/browse/HDFS-15029
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: rbf
Reporter: Chao Sun


After the work for batched listing in HDFS is implemented, we should also 
support the API for RBF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15015) Backport HDFS-5040 to branch-2

2019-11-26 Thread Chao Sun (Jira)
Chao Sun created HDFS-15015:
---

 Summary: Backport HDFS-5040 to branch-2
 Key: HDFS-15015
 URL: https://issues.apache.org/jira/browse/HDFS-15015
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: logging
Reporter: Chao Sun
Assignee: Chao Sun


HDFS-5040 added audit logging for several admin commands which are useful for 
diagnosing and debugging. For instance, {{getDatanodeReport}} is an expensive 
call and can be invoked by components such as RBF for metrics and others. It's 
better to track them in audit log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15014) [RBF] WebHdfs chooseDatanode shouldn't call getDatanodeReport

2019-11-26 Thread Chao Sun (Jira)
Chao Sun created HDFS-15014:
---

 Summary: [RBF] WebHdfs chooseDatanode shouldn't call 
getDatanodeReport 
 Key: HDFS-15014
 URL: https://issues.apache.org/jira/browse/HDFS-15014
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Reporter: Chao Sun


Currently the {{chooseDatanode}} call (which is shared by {{open}}, {{create}}, 
{{append}} and {{getFileChecksum}}) in RBF WebHDFS calls {{getDatanodeReport}} 
from ALL downstream namenodes:

{code}
  private DatanodeInfo chooseDatanode(final Router router,
  final String path, final HttpOpParam.Op op, final long openOffset,
  final String excludeDatanodes) throws IOException {
// We need to get the DNs as a privileged user
final RouterRpcServer rpcServer = getRPCServer(router);
UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
RouterRpcServer.setCurrentUser(loginUser);

DatanodeInfo[] dns = null;
try {
  dns = rpcServer.getDatanodeReport(DatanodeReportType.LIVE);
} catch (IOException e) {
  LOG.error("Cannot get the datanodes from the RPC server", e);
} finally {
  // Reset ugi to remote user for remaining operations.
  RouterRpcServer.resetCurrentUser();
}

HashSet excludes = new HashSet();
if (excludeDatanodes != null) {
  Collection collection =
  getTrimmedStringCollection(excludeDatanodes);
  for (DatanodeInfo dn : dns) {
if (collection.contains(dn.getName())) {
  excludes.add(dn);
}
  }
}
...
{code}

The {{getDatanodeReport}} is very expensive (particularly in a large cluster) 
as it need to lock the {{DatanodeManager}} which is also shared by calls such 
as processing heartbeats. Check HDFS-14366 for a similar issue.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15005) Backport HDFS-12300 to branch-2

2019-11-21 Thread Chao Sun (Jira)
Chao Sun created HDFS-15005:
---

 Summary: Backport HDFS-12300 to branch-2
 Key: HDFS-15005
 URL: https://issues.apache.org/jira/browse/HDFS-15005
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Chao Sun
Assignee: Chao Sun


Having DT related information is very useful in audit log. This tracks effort 
to backport HDFS-12300 to branch-2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14034) Support getQuotaUsage API in WebHDFS

2019-08-07 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reopened HDFS-14034:
-

Re-opening this for backporting to branch-2.

> Support getQuotaUsage API in WebHDFS
> 
>
> Key: HDFS-14034
> URL: https://issues.apache.org/jira/browse/HDFS-14034
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, webhdfs
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14034-branch-2.000.patch, 
> HDFS-14034-branch-2.001.patch, HDFS-14034.000.patch, HDFS-14034.001.patch, 
> HDFS-14034.002.patch, HDFS-14034.004.patch
>
>
> HDFS-8898 added support for a new API, {{getQuotaUsage}} which can fetch 
> quota usage on a directory with significantly lower impact than the similar 
> {{getContentSummary}}. This JIRA is to track adding support for this API to 
> WebHDFS. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14671) WebHDFS: Add erasureCodingPolicy to ContentSummary

2019-07-31 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-14671.
-
Resolution: Duplicate

> WebHDFS: Add erasureCodingPolicy to ContentSummary
> --
>
> Key: HDFS-14671
> URL: https://issues.apache.org/jira/browse/HDFS-14671
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: webhdfs
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> HDFS-11647 added {{erasureCodingPolicy}} to {{ContentSummary}}. We should add 
> this info to the result from WebHDFS {{getContentSummary}} call as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14671) WebHDFS: Add erasureCodingPolicy to ContentSummary

2019-07-25 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14671:
---

 Summary: WebHDFS: Add erasureCodingPolicy to ContentSummary
 Key: HDFS-14671
 URL: https://issues.apache.org/jira/browse/HDFS-14671
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: webhdfs
Reporter: Chao Sun
Assignee: Chao Sun


HDFS-11647 added {{erasureCodingPolicy}} to {{ContentSummary}}. We should add 
this info to the result from WebHDFS {{getContentSummary}} call as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14110) NPE when serving http requests while NameNode is starting up

2019-07-18 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-14110.
-
Resolution: Duplicate

> NPE when serving http requests while NameNode is starting up
> 
>
> Key: HDFS-14110
> URL: https://issues.apache.org/jira/browse/HDFS-14110
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.8.2
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> In 2.8.2 we saw this exception when a security-enabled NameNode is still 
> loading edits:
> {code:java}
> 2018-11-28 00:21:02,909 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected 
> pause in JVM or host machine (eg GC): pause of approximately 2068ms GC pool 
> 'ParNew' had collection(s): count=1 time=2325ms 2018-11-28 00:21:05,768 WARN 
> org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR 
> java.lang.NullPointerException at 
> org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283)
>  at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) 
> at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54)
>  at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42)
>  at 
> com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at 
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>  at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
> at org.mortbay.jetty.Server.handle(Server.java:326) at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at 

[jira] [Created] (HDFS-14660) [SBN Read] ObserverNameNode should throw StandbyException for requests not from ObserverProxyProvider

2019-07-17 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14660:
---

 Summary: [SBN Read] ObserverNameNode should throw StandbyException 
for requests not from ObserverProxyProvider
 Key: HDFS-14660
 URL: https://issues.apache.org/jira/browse/HDFS-14660
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


In a HDFS HA cluster with consistent reads enabled (HDFS-12943), clients could 
be using either {{ObserverReadProxyProvider}}, {{ConfiguredProxyProvider}}, or 
something else. Since observer is just a special type of SBN and we allow 
transitions between them, a client NOT using {{ObserverReadProxyProvider}} will 
need to have {{dfs.ha.namenodes.}} include all NameNodes in the 
cluster, and therefore, it may send request to a observer node.

For this case, we should check whether the {{stateId}} in the incoming RPC 
header is set or not, and throw an {{StandbyException}} when it is not. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-13189) Standby NameNode should roll active edit log when checkpointing

2019-05-01 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-13189.
-
Resolution: Duplicate

> Standby NameNode should roll active edit log when checkpointing
> ---
>
> Key: HDFS-13189
> URL: https://issues.apache.org/jira/browse/HDFS-13189
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: Chao Sun
>Priority: Minor
>
> When the SBN is doing checkpointing, it will hold the {{cpLock}}. In the 
> current implementation of edit log tailer thread, it will first check and 
> roll active edit log, and then tail and apply edits. In the case of 
> checkpointing, it will be blocked on the {{cpLock}} and will not roll the 
> edit log.
> It seems there is no dependency between the edit log roll and tailing edits, 
> so a better may be to do these in separate threads. This will be helpful for 
> people who uses the observer feature without in-progress edit log tailing. 
> An alternative is to configure 
> {{dfs.namenode.edit.log.autoroll.multiplier.threshold}} and 
> {{dfs.namenode.edit.log.autoroll.check.interval.ms}} to let ANN roll its own 
> log more frequently in case SBN is stuck on the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14415) Backport HDFS-13799 to branch-2

2019-04-04 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14415:
---

 Summary: Backport HDFS-13799 to branch-2
 Key: HDFS-14415
 URL: https://issues.apache.org/jira/browse/HDFS-14415
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chao Sun
Assignee: Chao Sun


As multi-SBN feature is already backported to branch-2, this is a follow-up to 
backport HDFS-13799.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14399) Backport HDFS-10536 to branch-2

2019-03-29 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14399:
---

 Summary: Backport HDFS-10536 to branch-2
 Key: HDFS-14399
 URL: https://issues.apache.org/jira/browse/HDFS-14399
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


As multi-SBN feature is already backported to branch-2, this is a follow-up to 
backport HADOOP-10536.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14397) Backport HADOOP-15684 to branch-2

2019-03-28 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14397:
---

 Summary: Backport HADOOP-15684 to branch-2
 Key: HDFS-14397
 URL: https://issues.apache.org/jira/browse/HDFS-14397
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


As multi-SBN feature is already backported to branch-2, this is a follow-up to 
backport HADOOP-15684.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14392) Backport HDFS-9787 to branch-2

2019-03-27 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14392:
---

 Summary: Backport HDFS-9787 to branch-2
 Key: HDFS-14392
 URL: https://issues.apache.org/jira/browse/HDFS-14392
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Chao Sun
Assignee: Chao Sun


As multi-SBN feature is already backported to branch-2, this is a follow-up to 
backport HDFS-9787.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14391) Backport HDFS-9659 to branch-2

2019-03-27 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14391:
---

 Summary: Backport HDFS-9659 to branch-2
 Key: HDFS-14391
 URL: https://issues.apache.org/jira/browse/HDFS-14391
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chao Sun
Assignee: Chao Sun


As multi-SBN feature is already backported to branch-2, this is a follow-up to 
backport HDFS-9659.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14366) Improve HDFS append performance

2019-03-12 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14366:
---

 Summary: Improve HDFS append performance
 Key: HDFS-14366
 URL: https://issues.apache.org/jira/browse/HDFS-14366
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Chao Sun
Assignee: Chao Sun


In our HDFS cluster we observed that {{append}} operation can take as much as 
10X write lock time than other write operations. By collecting flamegraph on 
the namenode (see attachment), we found that most of the append call is spent 
on {{getNumLiveDataNodes()}}:

{code}
  /** @return the number of live datanodes. */
  public int getNumLiveDataNodes() {
int numLive = 0;
synchronized (this) {
  for(DatanodeDescriptor dn : datanodeMap.values()) {
if (!isDatanodeDead(dn) ) {
  numLive++;
}
  }
}
return numLive;
  }
{code}
this method synchronizes on the {{DatanodeManager}} which is particularly 
expensive in large clusters since {{datanodeMap}} is being modified in many 
places such as processing DN heartbeats.

For {{append}} operation, {{getNumLiveDataNodes()}} is invoked in 
{{isSufficientlyReplicated}}:
{code}
  /**
   * Check if a block is replicated to at least the minimum replication.
   */
  public boolean isSufficientlyReplicated(BlockInfo b) {
// Compare against the lesser of the minReplication and number of live DNs.
final int replication =
Math.min(minReplication, getDatanodeManager().getNumLiveDataNodes());
return countNodes(b).liveReplicas() >= replication;
  }
{code}

The way that the {{replication}} is calculated is not very optimal, as it will 
call {{getNumLiveDataNodes()}} every time even though usually 
{{minReplication}} is much smaller than the latter. 




 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14346) EditLogTailer loses precision for sub-second edit log tailing and rolling interval

2019-03-07 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14346:
---

 Summary: EditLogTailer loses precision for sub-second edit log 
tailing and rolling interval
 Key: HDFS-14346
 URL: https://issues.apache.org/jira/browse/HDFS-14346
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Chao Sun
Assignee: Chao Sun


{{EditLogTailer}} currently uses the following:
{code}
logRollPeriodMs = conf.getTimeDuration(
DFSConfigKeys.DFS_HA_LOGROLL_PERIOD_KEY,
DFSConfigKeys.DFS_HA_LOGROLL_PERIOD_DEFAULT, TimeUnit.SECONDS) * 1000;

sleepTimeMs = conf.getTimeDuration(
DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_KEY,
DFSConfigKeys.DFS_HA_TAILEDITS_PERIOD_DEFAULT, TimeUnit.SECONDS) * 1000;
{code}
to determine the edit log roll and tail frequency. However, if user specify 
sub-second frequency, such as {{100ms}}, it will lose precision and become 0s. 
This is not ideal for some scenarios such as standby reads (HDFS-12943).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes

2019-02-21 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14305:
---

 Summary: Serial number in BlockTokenSecretManager could overlap 
between different namenodes
 Key: HDFS-14305
 URL: https://issues.apache.org/jira/browse/HDFS-14305
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: security
Reporter: Chao Sun
Assignee: Chao Sun


Currently, a {{BlockTokenSecretManager}} starts with a random integer as the 
initial serial number, and then use this formula to rotate it:
{code:java}
this.intRange = Integer.MAX_VALUE / numNNs;
this.nnRangeStart = intRange * nnIndex;
this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
 {code}
while {{numNNs}} is the total number of NameNodes in the cluster, and 
{{nnIndex}} is the index of the current NameNode specified in the configuration 
{{dfs.ha.namenodes.}}.

However, with this approach, different NameNode could have overlapping ranges 
for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, 
and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges 
for these two are:
{code}
nn1 -> [-49, 49]
nn2 -> [1, 99]
{code}
This is because the initial serial number could be any negative integer.

Moreover, when the keys are updated, the serial number will again be updated 
with the formula:
{code}
this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
{code}
which means the new serial number could be updated to a range that belongs to a 
different NameNode, thus increasing the chance of collision again.

When the collision happens, DataNodes could overwrite an existing key which 
will cause clients to fail because of {{InvalidToken}} error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14250) [Standby Reads] msync should sync with active NameNode to fetch the latest stateID

2019-01-31 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14250:
---

 Summary: [Standby Reads] msync should sync with active NameNode to 
fetch the latest stateID
 Key: HDFS-14250
 URL: https://issues.apache.org/jira/browse/HDFS-14250
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Chao Sun
Assignee: Chao Sun


Currently the {{msync}} call is a dummy operation to observer without really 
syncing. Instead, it should:
 # Get the latest stateID from active NN.
 # Use the stateID to talk to observer NN and make sure it is synced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14168) Fix TestWebHdfsTimeouts

2018-12-31 Thread Chao Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-14168.
-
Resolution: Duplicate

> Fix TestWebHdfsTimeouts
> ---
>
> Key: HDFS-14168
> URL: https://issues.apache.org/jira/browse/HDFS-14168
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: webhdfs
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> The test TestWebHdfsTimeouts keep failing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14168) Fix TestWebHdfsTimeouts

2018-12-22 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14168:
---

 Summary: Fix TestWebHdfsTimeouts
 Key: HDFS-14168
 URL: https://issues.apache.org/jira/browse/HDFS-14168
 Project: Hadoop HDFS
  Issue Type: Test
  Components: webhdfs
Reporter: Chao Sun
Assignee: Chao Sun


The test TestWebHdfsTimeouts keep failing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14154) Add explanation for dfs.ha.tail-edits.period in user guide.

2018-12-17 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14154:
---

 Summary: Add explanation for dfs.ha.tail-edits.period in user 
guide.
 Key: HDFS-14154
 URL: https://issues.apache.org/jira/browse/HDFS-14154
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: documentation
Reporter: Chao Sun
Assignee: Chao Sun


We should document {{dfs.ha.tail-edits.period}} in the user guide. The default 
value is too large for {{ObserverReadProxyProvider}} and we should recommend a 
value.

We can also address some remaining issues in HDFS-14131.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14146) Handle exception from internalQueueCall

2018-12-12 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14146:
---

 Summary: Handle exception from internalQueueCall
 Key: HDFS-14146
 URL: https://issues.apache.org/jira/browse/HDFS-14146
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ipc
Reporter: Chao Sun
Assignee: Chao Sun


When we re-queue RPC call, the {{internalQueueCall}} will potentially throw 
exceptions (e.g., RPC backoff), which is then swallowed. This will cause the 
RPC to be silently discarded without response to the client, which is not good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14110) NPE when serving http requests while NameNode is starting up

2018-11-28 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14110:
---

 Summary: NPE when serving http requests while NameNode is starting 
up
 Key: HDFS-14110
 URL: https://issues.apache.org/jira/browse/HDFS-14110
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.8.2
Reporter: Chao Sun
Assignee: Chao Sun


In 2.8.2 we saw this exception when a security-enabled NameNode is still 
loading edits:
{code:java}
2018-11-28 00:21:02,909 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected 
pause in JVM or host machine (eg GC): pause of approximately 2068ms GC pool 
'ParNew' had collection(s): count=1 time=2325ms 2018-11-28 00:21:05,768 WARN 
org.apache.hadoop.hdfs.web.resources.ExceptionHandler: INTERNAL_SERVER_ERROR 
java.lang.NullPointerException at 
org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283) 
at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) at 
org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54)
 at 
org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42)
 at 
com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46)
 at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153)
 at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203)
 at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
 at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
 at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
 at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
 at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353)
 at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at 
org.mortbay.jetty.Server.handle(Server.java:326) at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
{code}
Looking at the code, this is where the NPE happened (the line with 

[jira] [Created] (HDFS-14067) Allow manual failover between standby and observer

2018-11-12 Thread Chao Sun (JIRA)
Chao Sun created HDFS-14067:
---

 Summary: Allow manual failover between standby and observer
 Key: HDFS-14067
 URL: https://issues.apache.org/jira/browse/HDFS-14067
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun
Assignee: Chao Sun


Currently if automatic failover is enabled in a HA environment, transition from 
standby to observer would be blocked:
{code}
[hdfs@*** hadoop-3.3.0-SNAPSHOT]$ bin/hdfs haadmin -transitionToObserver ha2
Automatic failover is enabled for NameNode at 
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the --forcemanual flag.
{code}

We should allow manual transition between standby and observer in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13931) Backport HDFS-6440 to branch 2.8.2

2018-09-21 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13931:
---

 Summary: Backport HDFS-6440 to branch 2.8.2
 Key: HDFS-13931
 URL: https://issues.apache.org/jira/browse/HDFS-13931
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


Currently HDFS-6440 is only in branch-3. This aims at backporting it to 
branch-2.8.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13924) Handle BlockMissingException when reading from observer

2018-09-17 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13924:
---

 Summary: Handle BlockMissingException when reading from observer
 Key: HDFS-13924
 URL: https://issues.apache.org/jira/browse/HDFS-13924
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun


Internally we found that reading from ObserverNode may result to 
{{BlockMissingException}}. This may happen when the observer sees a smaller 
number of DNs than active (maybe due to communication issue with those DNs), or 
(we guess) late block reports from some DNs to the observer. This error happens 
in 
[DFSInputStream#chooseDataNode|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L846],
 when no valid DN can be found for the {{LocatedBlock}} got from the NN side.

One potential solution (although a little hacky) is to ask the 
{{DFSInputStream}} to retry active when this happens. The retry logic already 
present in the code - we just have to dynamically set a flag to ask the 
{{ObserverReadProxyProvider}} try active in this case.

cc [~shv], [~xkrogen], [~vagarychen], [~zero45] for discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13898) Throw retriable exception for getBlockLocations when ObserverNameNode is in safemode

2018-09-06 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13898:
---

 Summary: Throw retriable exception for getBlockLocations when 
ObserverNameNode is in safemode
 Key: HDFS-13898
 URL: https://issues.apache.org/jira/browse/HDFS-13898
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun


When ObserverNameNode is in safe mode, {{getBlockLocations}} may throw safe 
mode exception if the given file doesn't have any block yet. 

{code}
try {
  checkOperation(OperationCategory.READ);
  res = FSDirStatAndListingOp.getBlockLocations(
  dir, pc, srcArg, offset, length, true);
  if (isInSafeMode()) {
for (LocatedBlock b : res.blocks.getLocatedBlocks()) {
  // if safemode & no block locations yet then throw safemodeException
  if ((b.getLocations() == null) || (b.getLocations().length == 0)) {
SafeModeException se = newSafemodeException(
"Zero blocklocations for " + srcArg);
if (haEnabled && haContext != null &&
haContext.getState().getServiceState() == 
HAServiceState.ACTIVE) {
  throw new RetriableException(se);
} else {
  throw se;
}
  }
}
  }
{code}

It only throws {{RetriableException}} for active NN so requests on observer may 
just fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13814) Remove super user privilege requirement for HAServiceProtocol#getServiceStatus

2018-08-09 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13814:
---

 Summary: Remove super user privilege requirement for 
HAServiceProtocol#getServiceStatus
 Key: HDFS-13814
 URL: https://issues.apache.org/jira/browse/HDFS-13814
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chao Sun
Assignee: Chao Sun


See details [in the discussion of 
HDFS-13749|https://issues.apache.org/jira/browse/HDFS-13749?focusedCommentId=16568693=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16568693].
 Currently HAServiceProtocol#getServiceStatus requires super user privilege, 
which doesn't seem necessary. For comparison: {{DFSAdmin#report}}, as well as 
{{SAFEMODE_GET}}, doesn't require super privilege.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13792) Fix FSN read/write lock metrics name

2018-08-03 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13792:
---

 Summary: Fix FSN read/write lock metrics name
 Key: HDFS-13792
 URL: https://issues.apache.org/jira/browse/HDFS-13792
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation, metrics
 Environment: The metrics name for FSN read/write lock should be in the 
format:

{code}
FSN(Read|Write)Lock`*OperationName*`NaNosNumOps{code}
{code}

not
{code}
FSN(Read|Write)Lock`*OperationName*`NumOps{code}
{code}

Reporter: Chao Sun
Assignee: Chao Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13749) Implement a new client protocol method to get NameNode state

2018-07-19 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13749:
---

 Summary: Implement a new client protocol method to get NameNode 
state
 Key: HDFS-13749
 URL: https://issues.apache.org/jira/browse/HDFS-13749
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun
Assignee: Chao Sun


Currently {{HAServiceProtocol#getServiceStatus}} requires super user privilege. 
Therefore, as a temporary solution, in HDFS-12976 we discover NameNode state by 
calling {{reportBadBlocks}}. Here, we'll properly implement this by adding a 
new method in client protocol to get the NameNode state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13735) Make QJM HTTP URL connection timeout configurable

2018-07-13 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13735:
---

 Summary: Make QJM HTTP URL connection timeout configurable
 Key: HDFS-13735
 URL: https://issues.apache.org/jira/browse/HDFS-13735
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: qjm
Reporter: Chao Sun
Assignee: Chao Sun


We've seen "connect timed out" happen internally when QJM tries to open HTTP 
connections to JNs. This is now using {{newDefaultURLConnectionFactory}} which 
uses the default timeout 60s, and is not configurable.

It would be better for this to be configurable, especially for ObserverNameNode 
(HDFS-12943), where latency is important, and 60s may not be a good value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13687) ConfiguredFailoverProxyProvider could direct requests to SBN

2018-06-15 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13687:
---

 Summary: ConfiguredFailoverProxyProvider could direct requests to 
SBN
 Key: HDFS-13687
 URL: https://issues.apache.org/jira/browse/HDFS-13687
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


In case there are multiple SBNs, and {{dfs.ha.allow.stale.reads}} is set to 
true, failover could go to a SBN which then may serve read requests from 
client. This may not be the expected behavior. This issue arises when we are 
working on HDFS-12943 and HDFS-12976.

A better approach for this could be to check {{HAServiceState}} and find out 
the active NN when performing failover. This also can reduce the # of failovers 
the client has to do in case of multiple SBNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13674) Improve documentation on Metrics

2018-06-13 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13674:
---

 Summary: Improve documentation on Metrics
 Key: HDFS-13674
 URL: https://issues.apache.org/jira/browse/HDFS-13674
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation, metrics
Reporter: Chao Sun
Assignee: Chao Sun


There are a few confusing places in the [Hadoop Metrics 
page|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html].
 For instance, there are duplicated entries such as {{FsImageLoadTime}}; some 
quantile metrics do not have corresponding entries, description on some 
quantile metrics are not very specific on what is the {{num}} variable in the 
metrics name, etc.

This JIRA targets at improving this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13664) Refactor ConfiguredFailoverProxyProvider to make inheritance easier

2018-06-07 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13664:
---

 Summary: Refactor ConfiguredFailoverProxyProvider to make 
inheritance easier
 Key: HDFS-13664
 URL: https://issues.apache.org/jira/browse/HDFS-13664
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Reporter: Chao Sun
Assignee: Chao Sun


In HDFS-12943 we'd like to introduce a new proxy provider that inherits 
{{ConfiguredFailoverProvider}}. Some refactoring is necessary to allow easier 
code sharing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13641) Add metrics for edit log tailin

2018-05-30 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13641:
---

 Summary: Add metrics for edit log tailin 
 Key: HDFS-13641
 URL: https://issues.apache.org/jira/browse/HDFS-13641
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-13600) Add toString() for RemoteMethod

2018-05-21 Thread Chao Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-13600.
-
Resolution: Duplicate

> Add toString() for RemoteMethod
> ---
>
> Key: HDFS-13600
> URL: https://issues.apache.org/jira/browse/HDFS-13600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> Saw messages like:
> {code}
> 2018-05-21 18:23:19,011 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient: Invocation 
> to "XXX" for 
> "org.apache.hadoop.hdfs.server.federation.router.RemoteMethod@390c38d2" timed 
> out
> {code}
> I think {{RemoteMethod}} needs a {{toString}} method.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13600) Add toString() for RemoteMethod

2018-05-21 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13600:
---

 Summary: Add toString() for RemoteMethod
 Key: HDFS-13600
 URL: https://issues.apache.org/jira/browse/HDFS-13600
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun
Assignee: Chao Sun


Saw messages like:
{code}
2018-05-21 18:23:19,011 ERROR 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient: Invocation to 
"XXX" for 
"org.apache.hadoop.hdfs.server.federation.router.RemoteMethod@390c38d2" timed 
out
{code}

I think {{RemoteMethod}} needs a {{toString}} method.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13578) Add ReadOnly annotation to methods in ClientProtocol

2018-05-16 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13578:
---

 Summary: Add ReadOnly annotation to methods in ClientProtocol
 Key: HDFS-13578
 URL: https://issues.apache.org/jira/browse/HDFS-13578
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun
Assignee: Chao Sun


For those read-only methods in {{ClientProtocol}}, we may want to use a 
{{@ReadOnly}} annotation to mark them, and then check in the proxy provider for 
observer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13286) Add haadmin commands to transition between standby and observer

2018-03-14 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13286:
---

 Summary: Add haadmin commands to transition between standby and 
observer
 Key: HDFS-13286
 URL: https://issues.apache.org/jira/browse/HDFS-13286
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chao Sun
Assignee: Chao Sun


As discussed in HDFS-12975, we should allow explicit transition between standby 
and observer through haadmin command, such as:
{code}
haadmin -transitionToObserver
{code}

Initially we should support transition from observer to standby, and standby to 
observer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-13266) Fix TestWebHdfsTimeouts

2018-03-12 Thread Chao Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-13266.
-
Resolution: Duplicate

> Fix TestWebHdfsTimeouts
> ---
>
> Key: HDFS-13266
> URL: https://issues.apache.org/jira/browse/HDFS-13266
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test, webhdfs
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> {{TestWebHdfsTimeouts}} fails on Linux, in my case, 
> {code}
> Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 
> 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016)
> {code}
> However, the test succeeds on Mac. It seems this is due to a change on 
> backlog queue implementation since Linux 2.2. See 
> http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. 
> Therefore, 
> [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353]
>  doesn't work as intended under Linux. We should figure out a way to fix this 
> or disable the relevant tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-13267) Fix TestWebHdfsTimeouts

2018-03-12 Thread Chao Sun (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved HDFS-13267.
-
Resolution: Duplicate

Oops. Created this twice because of JIRA lag. Closing this one now.

> Fix TestWebHdfsTimeouts
> ---
>
> Key: HDFS-13267
> URL: https://issues.apache.org/jira/browse/HDFS-13267
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test, webhdfs
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> {{TestWebHdfsTimeouts}} fails on Linux, in my case, 
> {code}
> Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 
> 4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016)
> {code}
> However, the test succeeds on Mac. It seems this is due to a change on 
> backlog queue implementation since Linux 2.2. See 
> http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. 
> Therefore, 
> [{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353]
>  doesn't work as intended under Linux. We should figure out a way to fix this 
> or disable the relevant tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13267) Fix TestWebHdfsTimeouts

2018-03-12 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13267:
---

 Summary: Fix TestWebHdfsTimeouts
 Key: HDFS-13267
 URL: https://issues.apache.org/jira/browse/HDFS-13267
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test, webhdfs
Reporter: Chao Sun
Assignee: Chao Sun


{{TestWebHdfsTimeouts}} fails on Linux, in my case, 

{code}
Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 
4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016)
{code}

However, the test succeeds on Mac. It seems this is due to a change on backlog 
queue implementation since Linux 2.2. See 
http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. 
Therefore, 
[{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353]
 doesn't work as intended under Linux. We should figure out a way to fix this 
or disable the relevant tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13266) Fix TestWebHdfsTimeouts

2018-03-12 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13266:
---

 Summary: Fix TestWebHdfsTimeouts
 Key: HDFS-13266
 URL: https://issues.apache.org/jira/browse/HDFS-13266
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test, webhdfs
Reporter: Chao Sun
Assignee: Chao Sun


{{TestWebHdfsTimeouts}} fails on Linux, in my case, 

{code}
Linux version 4.4.38 (jenkins@debbuilder02-sjc1) (gcc version 4.9.2 (Debian 
4.9.2-10) ) #1 SMP Mon Dec 12 09:01:31 UTC 2016)
{code}

However, the test succeeds on Mac. It seems this is due to a change on backlog 
queue implementation since Linux 2.2. See 
http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html. 
Therefore, 
[{{consumeConnectionBacklog}}|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHdfsTimeouts.java#L353]
 doesn't work as intended under Linux. We should figure out a way to fix this 
or disable the relevant tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13202) Fix javadoc in HAUtil and small refactoring

2018-02-27 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13202:
---

 Summary: Fix javadoc in HAUtil and small refactoring
 Key: HDFS-13202
 URL: https://issues.apache.org/jira/browse/HDFS-13202
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chao Sun
Assignee: Chao Sun


There are a few outdated javadocs in {{HAUtil}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13189) Standby NameNode should roll active edit log when checkpointing

2018-02-23 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13189:
---

 Summary: Standby NameNode should roll active edit log when 
checkpointing
 Key: HDFS-13189
 URL: https://issues.apache.org/jira/browse/HDFS-13189
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Chao Sun


When the SBN is doing checkpointing, it will hold the {{cpLock}}. In the 
current implementation of edit log tailer thread, it will first check and roll 
active edit log, and then tail and apply edits. In the case of checkpointing, 
it will be blocked on the {{cpLock}} and will not roll the edit log.

It seems there is no dependency between the edit log roll and tailing edits, so 
a better may be to do these in separate threads. This will be helpful for 
people who uses the observer feature without in-progress edit log tailing. 

An alternative is to configure 
{{dfs.namenode.edit.log.autoroll.multiplier.threshold}} and 
{{dfs.namenode.edit.log.autoroll.check.interval.ms}} to let ANN roll its own 
log more frequently in case SBN is stuck on the lock.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13182) Allow Observer to participate in NameNode failover

2018-02-22 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13182:
---

 Summary: Allow Observer to participate in NameNode failover
 Key: HDFS-13182
 URL: https://issues.apache.org/jira/browse/HDFS-13182
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, namenode
Reporter: Chao Sun


As discussed in the design doc, when there is no SBN available, Observer should 
be eligible for namenode failover. See HDFS-12975 for some preliminary findings 
in this effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13152) Support observer reads for WebHDFS

2018-02-15 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13152:
---

 Summary: Support observer reads for WebHDFS 
 Key: HDFS-13152
 URL: https://issues.apache.org/jira/browse/HDFS-13152
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: ha, namenode, webhdfs
Reporter: Chao Sun


In the case of WebHDFS, a designated DN will launch a DFSClient and stream data 
to the WebHDFS client. The DFSClient will read the local configuration files 
(e.g., hdfs-site.xml) on the DN. So, in this case changes may be required for 
WebHDFS to enable observer reads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-13145) SBN crash when transition to ANN with in-progress edit tailing enabled

2018-02-13 Thread Chao Sun (JIRA)
Chao Sun created HDFS-13145:
---

 Summary: SBN crash when transition to ANN with in-progress edit 
tailing enabled
 Key: HDFS-13145
 URL: https://issues.apache.org/jira/browse/HDFS-13145
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, namenode
Reporter: Chao Sun
Assignee: Chao Sun


With edit log in-progress edit log tailing enabled, {{QuorumOutputStream}} will 
send two batches to JNs, one normal edit batch followed by a dummy batch to 
update the commit ID on JNs.

{code}
  QuorumCall qcall = loggers.sendEdits(
  segmentTxId, firstTxToFlush,
  numReadyTxns, data);
  loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
  
  // Since we successfully wrote this batch, let the loggers know. Any 
future
  // RPCs will thus let the loggers know of the most recent transaction, 
even
  // if a logger has fallen behind.
  loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);

  // If we don't have this dummy send, committed TxId might be one-batch
  // stale on the Journal Nodes
  if (updateCommittedTxId) {
QuorumCall fakeCall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
0, new byte[0]);
loggers.waitForWriteQuorum(fakeCall, writeTimeoutMs, "sendEdits");
  }
{code}

Between each batch, it will wait for the JNs to reach a quorum. However, if the 
ANN crashes in between, then SBN will crash while transiting to ANN:

{code}
java.lang.IllegalStateException: Cannot start writing at txid 24312595802 when 
there is a stream available for read: ..
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:329)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1196)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1839)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1707)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1622)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:851)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:794)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2490)
2018-02-13 00:43:20,728 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1
{code}

This is because without the dummy batch, the {{commitTxnId}} will lag behind 
the {{endTxId}}, which caused the check in {{openForWrite}} to fail:
{code}
List streams = new ArrayList();
journalSet.selectInputStreams(streams, segmentTxId, true, false);
if (!streams.isEmpty()) {
  String error = String.format("Cannot start writing at txid %s " +
"when there is a stream available for read: %s",
segmentTxId, streams.get(0));
  IOUtils.cleanupWithLogger(LOG,
  streams.toArray(new EditLogInputStream[0]));
  throw new IllegalStateException(error);
}
{code}

In our environment, this can be reproduced pretty consistently, which will 
leave the cluster with no running namenodes. Even though we are using a 2.8.2 
backport, I believe the same issue also exist in 3.0.x. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-12932) Confusing LOG message for block replication

2017-12-15 Thread Chao Sun (JIRA)
Chao Sun created HDFS-12932:
---

 Summary: Confusing LOG message for block replication
 Key: HDFS-12932
 URL: https://issues.apache.org/jira/browse/HDFS-12932
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 2.8.3
Reporter: Chao Sun
Assignee: Chao Sun
Priority: Minor


In our cluster we see large number of log messages such as the following:
{code}
2017-12-15 22:55:54,603 INFO 
org.apache.hadoop.hdfs.server.namenode.FSDirectory: Increasing replication from 
3 to 3 for 
{code}

This is a little confusing since "from 3 to 3" is not "increasing". Digging 
into it, it seems related to this piece of code:
{code}
if (oldBR != -1) {
  if (oldBR > targetReplication) {
FSDirectory.LOG.info("Decreasing replication from {} to {} for {}",
 oldBR, targetReplication, iip.getPath());
  } else {
FSDirectory.LOG.info("Increasing replication from {} to {} for {}",
 oldBR, targetReplication, iip.getPath());
  }
}
{code}
Perhaps a {{oldBR == targetReplication}} case is missing?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-12836) startTxId could be greater than endTxId when tailing in-progress edit log

2017-11-17 Thread Chao Sun (JIRA)
Chao Sun created HDFS-12836:
---

 Summary: startTxId could be greater than endTxId when tailing 
in-progress edit log
 Key: HDFS-12836
 URL: https://issues.apache.org/jira/browse/HDFS-12836
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Chao Sun
Assignee: Chao Sun


When {{dfs.ha.tail-edits.in-progress}} is true, edit log tailer will also tail 
those in progress edit log segments. However, in the following code:

{code}
if (onlyDurableTxns && inProgressOk) {
  endTxId = Math.min(endTxId, committedTxnId);
}

EditLogInputStream elis = EditLogFileInputStream.fromUrl(
connectionFactory, url, remoteLog.getStartTxId(),
endTxId, remoteLog.isInProgress());
{code}

it is possible that {{remoteLog.getStartTxId()}} could be greater than 
{{endTxId}}, and therefore will cause the following error:

{code}
2017-11-17 19:55:41,165 ERROR org.apache.hadoop.hdfs.server.namenode.FSImage: 
Error replaying edit log at offset 1048576.  Expected transaction ID was 87
Recent opcode offsets: 1048576
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException:
 got premature end-of-file at txid 86; expected file to go up to 85
at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:197)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:189)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
2017-11-17 19:55:41,165 WARN 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Error while reading 
edits from disk. Will try again.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying 
edit log at offset 1048576.  Expected transaction ID was 87
Recent opcode offsets: 1048576
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:218)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
Caused by: 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException:
 got premature end-of-file at txid 86; expected file to go up to 85
at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:197)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:189)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205)
... 9 

[jira] [Created] (HDFS-12669) Implement toString() for EditLogInputStream

2017-10-16 Thread Chao Sun (JIRA)
Chao Sun created HDFS-12669:
---

 Summary: Implement toString() for EditLogInputStream
 Key: HDFS-12669
 URL: https://issues.apache.org/jira/browse/HDFS-12669
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Chao Sun
Priority: Minor


Currently {{EditLogInputStream}} has {{getName()}} but doesn't implement 
{{toString()}}. The latter could be useful in debugging. Currently it just 
print out messages like:

{code}
2017-10-16 20:41:13,456 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Reading 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@1eb6749b 
expecting start txid #8137
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org