[jira] [Work logged] (HDFS-16279) Print detail datanode info when process first storage report
[ https://issues.apache.org/jira/browse/HDFS-16279?focusedWorklogId=669890=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669890 ] ASF GitHub Bot logged work on HDFS-16279: - Author: ASF GitHub Bot Created on: 26/Oct/21 05:15 Start Date: 26/Oct/21 05:15 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3564: URL: https://github.com/apache/hadoop/pull/3564#issuecomment-951568519 Hi @jojochuang @ayushtkn @tasanuma , could you please take a look at this. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669890) Time Spent: 1h 40m (was: 1.5h) > Print detail datanode info when process first storage report > > > Key: HDFS-16279 > URL: https://issues.apache.org/jira/browse/HDFS-16279 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Attachments: image-2021-10-19-20-37-55-850.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Print detail datanode info when process block report. > !image-2021-10-19-20-37-55-850.png|width=547,height=98! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16281) Fix flaky unit tests failed due to timeout
[ https://issues.apache.org/jira/browse/HDFS-16281?focusedWorklogId=669891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669891 ] ASF GitHub Bot logged work on HDFS-16281: - Author: ASF GitHub Bot Created on: 26/Oct/21 05:15 Start Date: 26/Oct/21 05:15 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3574: URL: https://github.com/apache/hadoop/pull/3574#issuecomment-951568712 Hi @jojochuang @tasanuma @ferhui , could you please take a look at this. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669891) Time Spent: 1h 40m (was: 1.5h) > Fix flaky unit tests failed due to timeout > -- > > Key: HDFS-16281 > URL: https://issues.apache.org/jira/browse/HDFS-16281 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: tomscut >Assignee: tomscut >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > I found that this unit test > *_TestViewFileSystemOverloadSchemeWithHdfsScheme_* failed several times due > to timeout. Can we change the timeout for some methods from _*3s*_ to *_30s_* > to be consistent with the other methods? > {code:java} > [ERROR] Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: > 65.39 s <<< FAILURE! - in > org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR] > Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 65.39 s <<< > FAILURE! - in > org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR] > > testNflyRepair(org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS) > Time elapsed: 4.132 s <<< > ERROR!org.junit.runners.model.TestTimedOutException: test timed out after > 3000 milliseconds at java.lang.Object.wait(Native Method) at > java.lang.Object.wait(Object.java:502) at > org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59) at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1577) at > org.apache.hadoop.ipc.Client.call(Client.java:1535) at > org.apache.hadoop.ipc.Client.call(Client.java:1432) at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129) > at com.sun.proxy.$Proxy26.setTimes(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setTimes(ClientNamenodeProtocolTranslatorPB.java:1059) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > at com.sun.proxy.$Proxy27.setTimes(Unknown Source) at > org.apache.hadoop.hdfs.DFSClient.setTimes(DFSClient.java:2658) at > org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1978) > at > org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1975) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.setTimes(DistributedFileSystem.java:1988) > at org.apache.hadoop.fs.FilterFileSystem.setTimes(FilterFileSystem.java:542) > at > org.apache.hadoop.fs.viewfs.ChRootedFileSystem.setTimes(ChRootedFileSystem.java:328) > at > org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.commit(NflyFSystem.java:439) > at > org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.close(NflyFSystem.java:395) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at >
[jira] [Work logged] (HDFS-16270) Improve NNThroughputBenchmark#printUsage() related to block size
[ https://issues.apache.org/jira/browse/HDFS-16270?focusedWorklogId=669879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669879 ] ASF GitHub Bot logged work on HDFS-16270: - Author: ASF GitHub Bot Created on: 26/Oct/21 02:23 Start Date: 26/Oct/21 02:23 Worklog Time Spent: 10m Work Description: jianghuazhu closed pull request #3547: URL: https://github.com/apache/hadoop/pull/3547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669879) Time Spent: 1h 20m (was: 1h 10m) > Improve NNThroughputBenchmark#printUsage() related to block size > > > Key: HDFS-16270 > URL: https://issues.apache.org/jira/browse/HDFS-16270 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks, namenode >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > When using the NNThroughputBenchmark test, if the usage is not correct, we > will get some prompt messages. > E.g: > ' > If connecting to a remote NameNode with -fs option, > dfs.namenode.fs-limits.min-block-size should be set to 16. > 21/10/13 11:55:32 INFO util.ExitUtil: Exiting with status -1: ExitException > ' > Yes, this way is good. > However, the setting of'dfs.blocksize' has been completed before execution, > for example: > conf.setInt(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, 16); > We will still get the above prompt, which is wrong. > At the same time, it should also be explained. The hint here should not be > for'dfs.namenode.fs-limits.min-block-size', but should be'dfs.blocksize'. > Because in the NNThroughputBenchmark construction, > the'dfs.namenode.fs-limits.min-block-size' has been set to 0 in advance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669861=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669861 ] ASF GitHub Bot logged work on HDFS-16266: - Author: ASF GitHub Bot Created on: 26/Oct/21 00:41 Start Date: 26/Oct/21 00:41 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3538: URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951452758 Previously, I didn't think it would be too much of an impact to append a port to the IP field if the feature was made configurable, but users might need to change the resolution rules. Considering compatibility, @tasanuma suggests adding fields, and @jojochuang suggests putting port in the CallerContext. If we put the port in the CallerContext, it will not affect field resolution. The content in the callerContext is also dynamic, which is more flexible. Thank you all for your advice and help. I will update the PR later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669861) Time Spent: 4h 20m (was: 4h 10m) > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669856=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669856 ] ASF GitHub Bot logged work on HDFS-16266: - Author: ASF GitHub Bot Created on: 26/Oct/21 00:30 Start Date: 26/Oct/21 00:30 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3538: URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951446234 > The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am fine with it. > > We used to have an audit logger (Cloudera Navigator) that extends the AuditLogger interface. But we've moved away from that. > > Performance: It would have a slight performance penalty because every audit log op will always convert InetAddress to a string, regardless if audit logger is off (audit log level = debug or dfs.namenode.audit.log.debug.cmdlist has the excluded op)). It's probably acceptable since audit is logged outside of namenode lock. > > CallerContext: the caller context is probably a better option when you want to do fine-grained post-mortem anyway. Maybe we can modify the caller context to attach remote port so that it doesn't break api compatibility. Just a thought. Thanks @jojochuang for your careful consideration and advice. I think it's a very good idea to add remote port to the CallerContext, these will not affect the compatibility @tasanuma mentioned. After the user enable the CallerContext, we add clientPort to the CallerContext, similar to how the Router sets clientIp to the CallerContext. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669856) Time Spent: 4h 10m (was: 4h) > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 4h 10m > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669855=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669855 ] ASF GitHub Bot logged work on HDFS-16266: - Author: ASF GitHub Bot Created on: 26/Oct/21 00:28 Start Date: 26/Oct/21 00:28 Worklog Time Spent: 10m Work Description: tomscut commented on pull request #3538: URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951445364 > The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am fine with it. > > We used to have an audit logger (Cloudera Navigator) that extends the AuditLogger interface. But we've moved away from that. > > Performance: It would have a slight performance penalty because every audit log op will always convert InetAddress to a string, regardless if audit logger is off (audit log level = debug or dfs.namenode.audit.log.debug.cmdlist has the excluded op)). It's probably acceptable since audit is logged outside of namenode lock. > > CallerContext: the caller context is probably a better option when you want to do fine-grained post-mortem anyway. Maybe we can modify the caller context to attach remote port so that it doesn't break api compatibility. Just a thought. > I haven't gone through the entire discussion/code. Just that whether we should modify the existing field or add a new one. Technically both are correct and I don't see any serious issue with either(not thinking too deep). But I feel for the parsers to adapt, if there was a new field, it might be little bit more easy, Rather than trying to figure out whether the existing field has a port or not. Just my thoughts, I am Ok with whichever way most people tend to agree. Anyway whatever we do should be optional & guarded by a config. Thanks @ayushtkn for your comments and suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669855) Time Spent: 4h (was: 3h 50m) > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls
Aihua Xu created HDFS-16283: --- Summary: RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls Key: HDFS-16283 URL: https://issues.apache.org/jira/browse/HDFS-16283 Project: Hadoop HDFS Issue Type: Sub-task Components: rbf Reporter: Aihua Xu Assignee: Aihua Xu Currently renewLease() against a router will make fan-out to all the NameNodes. Since renewLease() call is so frequent and if one of the NameNodes are slow, then eventually the router queues are blocked by all renewLease() and cause router degradation. We will make a change in the client side to keep track of NameNode Id in additional to current fileId so routers understand which NameNodes the client is renewing lease against. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433702#comment-17433702 ] tomscut commented on HDFS-16266: [~weichiu] Thank you very much for your comments. Currently, FairCallQueue is not enabled in our cluster. Indeed, without long-running connections, it's really hard to track tasks based on ports. So this should only be used as an auxiliary method. In our production environment, we can indeed get applicationids (e.g. MR or Spark) for certain tasks based on the CallerContext and then trace the relevant users. It's very efficient. But sometimes, users perform HDFS operations on user-defined tasks. In such scenarios, the CallerContext may have no content. So we may find these tasks by "ip:port". > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite
[ https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433687#comment-17433687 ] Janus Chow commented on HDFS-16275: --- [~ayushtkn] Thank you for your quick explanation. In fact, I was curious and confused about the default "false" here. I thought the considerLoad is as same as the current conditions in `isGoodDatanode`, as exclude stale or exclude slow node, they are kind of no difference for locality or non-locality. For configuration, since the considerLoad is from the config of "dfs.namenode.replication.considerLoad", would "dfs.namenode.replication.locality.considerLoad" be a choice? > [HDFS] Enable considerLoad for localWrite > - > > Key: HDFS-16275 > URL: https://issues.apache.org/jira/browse/HDFS-16275 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Janus Chow >Assignee: Janus Chow >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently when client is on the same machine of a datanode, it will try to > write to the local machine regardless of the load of the datanode, that is > the xceiverCount. > In our production cluster, datanode and Nodemanager are running on the same > server, so when there are heavy jobs running on a labeled queue, the > corresponding datanodes will have higher xceiverCounts than other datanodes. > When other clients are trying to write, the exception of "could only be > replicated to 0 nodes" would be thrown. > This ticket is to enable considerLoad to avoid the hot local write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13514) BenchmarkThroughput.readLocalFile hangs with misconfigured BUFFER_SIZE
[ https://issues.apache.org/jira/browse/HDFS-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433680#comment-17433680 ] Ayush Saxena commented on HDFS-13514: - Not sure which PR is the latest, Can you close the duplicate jiras and PR, and only keep the active PR open. If possible please extend a test as well > BenchmarkThroughput.readLocalFile hangs with misconfigured BUFFER_SIZE > -- > > Key: HDFS-13514 > URL: https://issues.apache.org/jira/browse/HDFS-13514 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.5.0 >Reporter: John Doe >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When the BUFFER_SIZE is configured to be 0, the while loop in > BenchmarkThroughput.readLocalFile function hangs endlessly. > This is because when the data.size (i.e., BUFFER_SIZE) is 0, the val will > always be 0 by invoking val=in.read(data). > Here is the code snippet. > {code:java} > BUFFER_SIZE = conf.getInt("dfsthroughput.buffer.size", 4 * 1024);//when > dfsthroughput.buffer.size is configued to be 0 > private void readLocalFile(Path path, String name, Configuration conf) > throws IOException { > System.out.print("Reading " + name); > resetMeasurements(); > InputStream in = new FileInputStream(new File(path.toString())); > byte[] data = new byte[BUFFER_SIZE]; > long size = 0; > while (size >= 0) { > size = in.read(data); > } > in.close(); > printMeasurements(); > } > {code} > The similar case is HDFS-13513 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite
[ https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433678#comment-17433678 ] Ayush Saxena commented on HDFS-16275: - Ohhk, By any chance have you explored AvailableSpaceBlockPlacementPolicy. That has a optimisation available for local node as well in form of a config {{dfs.namenode.available-space-block-placement-policy.balance-local-node}} I haven't gone through the code, but the change proposed should be configurable & by default turned off, for backward compatibility > [HDFS] Enable considerLoad for localWrite > - > > Key: HDFS-16275 > URL: https://issues.apache.org/jira/browse/HDFS-16275 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Janus Chow >Assignee: Janus Chow >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently when client is on the same machine of a datanode, it will try to > write to the local machine regardless of the load of the datanode, that is > the xceiverCount. > In our production cluster, datanode and Nodemanager are running on the same > server, so when there are heavy jobs running on a labeled queue, the > corresponding datanodes will have higher xceiverCounts than other datanodes. > When other clients are trying to write, the exception of "could only be > replicated to 0 nodes" would be thrown. > This ticket is to enable considerLoad to avoid the hot local write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite
[ https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433671#comment-17433671 ] Janus Chow commented on HDFS-16275: --- [~ayushtkn] Thanks for the comment. I think we do like to have data locality, only not a too hot one. IMHO when the node is not too hot, the locality should boost the performance. The change from default "false" is mainly for cooling the node down. > [HDFS] Enable considerLoad for localWrite > - > > Key: HDFS-16275 > URL: https://issues.apache.org/jira/browse/HDFS-16275 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Janus Chow >Assignee: Janus Chow >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently when client is on the same machine of a datanode, it will try to > write to the local machine regardless of the load of the datanode, that is > the xceiverCount. > In our production cluster, datanode and Nodemanager are running on the same > server, so when there are heavy jobs running on a labeled queue, the > corresponding datanodes will have higher xceiverCounts than other datanodes. > When other clients are trying to write, the exception of "could only be > replicated to 0 nodes" would be thrown. > This ticket is to enable considerLoad to avoid the hot local write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669400=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669400 ] ASF GitHub Bot logged work on HDFS-16266: - Author: ASF GitHub Bot Created on: 25/Oct/21 08:11 Start Date: 25/Oct/21 08:11 Worklog Time Spent: 10m Work Description: jojochuang commented on pull request #3538: URL: https://github.com/apache/hadoop/pull/3538#issuecomment-950643667 The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am fine with it. We used to have an audit logger (Cloudera Navigator) that extends the AuditLogger interface. But we've moved away from that. Performance: It would have a slight performance penalty because every audit log op will always convert InetAddress to a string, regardless if audit logger is off (audit log level = debug or dfs.namenode.audit.log.debug.cmdlist has the excluded op)). It's probably acceptable since audit is logged outside of namenode lock. CallerContext: the caller context is probably a better option when you want to do fine-grained post-mortem anyway. Maybe we can modify the caller context to attach remote port so that it doesn't break api compatibility. Just a thought. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 669400) Time Spent: 3h 50m (was: 3h 40m) > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16266) Add remote port information to HDFS audit log
[ https://issues.apache.org/jira/browse/HDFS-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433608#comment-17433608 ] Wei-Chiu Chuang commented on HDFS-16266: Thanks for reporting the issue and submitting the PR. So, I have a few design/architectural comments other than the code attached in the PR. 1. for the abusive users, have you tried enable FairCallQueue to punish those bad users? If so, did you find it not sufficient to combat resource usage problem? Is it because the users issued recursive commands like 'du' (contentSummary) calls? 2. the current audit logger supports CallerContext. Applications (e.g. Hive) that support this semantics can attach a signature that is then passed from application to namenode. IMO this is a more explicit way to do post-moretem. > Add remote port information to HDFS audit log > - > > Key: HDFS-16266 > URL: https://issues.apache.org/jira/browse/HDFS-16266 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > In our production environment, we occasionally encounter a problem where a > user submits an abnormal computation task, causing a sudden flood of > requests, which causes the queueTime and processingTime of the Namenode to > rise very high, causing a large backlog of tasks. > We usually locate and kill specific Spark, Flink, or MapReduce tasks based on > metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but > there is no port information, so it is difficult to locate specific processes > sometimes. Therefore, I propose that we add the port information to the audit > log, so that we can easily track the upstream process. > Currently, some projects contain port information in audit logs, such as > Hbase and Alluxio. I think it is also necessary to add port information for > HDFS audit logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org