[jira] [Work logged] (HDFS-16279) Print detail datanode info when process first storage report

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16279?focusedWorklogId=669890=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669890
 ]

ASF GitHub Bot logged work on HDFS-16279:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 05:15
Start Date: 26/Oct/21 05:15
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3564:
URL: https://github.com/apache/hadoop/pull/3564#issuecomment-951568519


   Hi @jojochuang  @ayushtkn @tasanuma , could you please take a look at this. 
Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669890)
Time Spent: 1h 40m  (was: 1.5h)

> Print detail datanode info when process first storage report
> 
>
> Key: HDFS-16279
> URL: https://issues.apache.org/jira/browse/HDFS-16279
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2021-10-19-20-37-55-850.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Print detail datanode info when process block report.
> !image-2021-10-19-20-37-55-850.png|width=547,height=98!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16281) Fix flaky unit tests failed due to timeout

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16281?focusedWorklogId=669891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669891
 ]

ASF GitHub Bot logged work on HDFS-16281:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 05:15
Start Date: 26/Oct/21 05:15
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3574:
URL: https://github.com/apache/hadoop/pull/3574#issuecomment-951568712


   Hi @jojochuang  @tasanuma @ferhui , could you please take a look at this. 
Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669891)
Time Spent: 1h 40m  (was: 1.5h)

> Fix flaky unit tests failed due to timeout
> --
>
> Key: HDFS-16281
> URL: https://issues.apache.org/jira/browse/HDFS-16281
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I found that this unit test 
> *_TestViewFileSystemOverloadSchemeWithHdfsScheme_* failed several times due 
> to timeout. Can we change the timeout for some methods from _*3s*_ to *_30s_* 
> to be consistent with the other methods?
> {code:java}
> [ERROR] Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 
> 65.39 s <<< FAILURE! - in 
> org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR]
>  Tests run: 19, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 65.39 s <<< 
> FAILURE! - in 
> org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS[ERROR]
>  
> testNflyRepair(org.apache.hadoop.fs.viewfs.TestViewFSOverloadSchemeWithMountTableConfigInHDFS)
>   Time elapsed: 4.132 s  <<< 
> ERROR!org.junit.runners.model.TestTimedOutException: test timed out after 
> 3000 milliseconds at java.lang.Object.wait(Native Method) at 
> java.lang.Object.wait(Object.java:502) at 
> org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:59) at 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1577) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1535) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1432) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)
>  at com.sun.proxy.$Proxy26.setTimes(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setTimes(ClientNamenodeProtocolTranslatorPB.java:1059)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>  at com.sun.proxy.$Proxy27.setTimes(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.setTimes(DFSClient.java:2658) at 
> org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1978)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$37.doCall(DistributedFileSystem.java:1975)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.setTimes(DistributedFileSystem.java:1988)
>  at org.apache.hadoop.fs.FilterFileSystem.setTimes(FilterFileSystem.java:542) 
> at 
> org.apache.hadoop.fs.viewfs.ChRootedFileSystem.setTimes(ChRootedFileSystem.java:328)
>  at 
> org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.commit(NflyFSystem.java:439)
>  at 
> org.apache.hadoop.fs.viewfs.NflyFSystem$NflyOutputStream.close(NflyFSystem.java:395)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
>  at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at 
> 

[jira] [Work logged] (HDFS-16270) Improve NNThroughputBenchmark#printUsage() related to block size

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16270?focusedWorklogId=669879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669879
 ]

ASF GitHub Bot logged work on HDFS-16270:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 02:23
Start Date: 26/Oct/21 02:23
Worklog Time Spent: 10m 
  Work Description: jianghuazhu closed pull request #3547:
URL: https://github.com/apache/hadoop/pull/3547


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669879)
Time Spent: 1h 20m  (was: 1h 10m)

> Improve NNThroughputBenchmark#printUsage() related to block size
> 
>
> Key: HDFS-16270
> URL: https://issues.apache.org/jira/browse/HDFS-16270
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When using the NNThroughputBenchmark test, if the usage is not correct, we 
> will get some prompt messages.
> E.g:
> '
> If connecting to a remote NameNode with -fs option, 
> dfs.namenode.fs-limits.min-block-size should be set to 16.
> 21/10/13 11:55:32 INFO util.ExitUtil: Exiting with status -1: ExitException
> '
> Yes, this way is good.
> However, the setting of'dfs.blocksize' has been completed before execution, 
> for example:
> conf.setInt(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, 16);
> We will still get the above prompt, which is wrong.
> At the same time, it should also be explained. The hint here should not be 
> for'dfs.namenode.fs-limits.min-block-size', but should be'dfs.blocksize'.
> Because in the NNThroughputBenchmark construction, 
> the'dfs.namenode.fs-limits.min-block-size' has been set to 0 in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669861=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669861
 ]

ASF GitHub Bot logged work on HDFS-16266:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 00:41
Start Date: 26/Oct/21 00:41
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3538:
URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951452758


   Previously, I didn't think it would be too much of an impact to append a 
port to the IP field if the feature was made configurable, but users might need 
to change the resolution rules.
   
   Considering compatibility, @tasanuma suggests adding fields, and @jojochuang 
suggests putting port in the CallerContext. If we put the port in the 
CallerContext, it will not affect field resolution. The content in the 
callerContext is also dynamic, which is more flexible. 
   
   Thank you all for your advice and help. I will update the PR later. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669861)
Time Spent: 4h 20m  (was: 4h 10m)

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669856=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669856
 ]

ASF GitHub Bot logged work on HDFS-16266:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 00:30
Start Date: 26/Oct/21 00:30
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3538:
URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951446234


   > The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am 
fine with it.
   > 
   > We used to have an audit logger (Cloudera Navigator) that extends the 
AuditLogger interface. But we've moved away from that.
   > 
   > Performance: It would have a slight performance penalty because every 
audit log op will always convert InetAddress to a string, regardless if audit 
logger is off (audit log level = debug or dfs.namenode.audit.log.debug.cmdlist 
has the excluded op)). It's probably acceptable since audit is logged outside 
of namenode lock.
   > 
   > CallerContext: the caller context is probably a better option when you 
want to do fine-grained post-mortem anyway. Maybe we can modify the caller 
context to attach remote port so that it doesn't break api compatibility. Just 
a thought.
   
   Thanks @jojochuang for your careful consideration and advice.
   
   I think it's a very good idea to add remote port to the CallerContext, these 
will not affect the compatibility @tasanuma  mentioned. After the user enable 
the CallerContext, we add clientPort to the CallerContext, similar to how the 
Router sets clientIp to the CallerContext. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669856)
Time Spent: 4h 10m  (was: 4h)

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669855=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669855
 ]

ASF GitHub Bot logged work on HDFS-16266:
-

Author: ASF GitHub Bot
Created on: 26/Oct/21 00:28
Start Date: 26/Oct/21 00:28
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3538:
URL: https://github.com/apache/hadoop/pull/3538#issuecomment-951445364


   > The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am 
fine with it.
   > 
   > We used to have an audit logger (Cloudera Navigator) that extends the 
AuditLogger interface. But we've moved away from that.
   > 
   > Performance: It would have a slight performance penalty because every 
audit log op will always convert InetAddress to a string, regardless if audit 
logger is off (audit log level = debug or dfs.namenode.audit.log.debug.cmdlist 
has the excluded op)). It's probably acceptable since audit is logged outside 
of namenode lock.
   > 
   > CallerContext: the caller context is probably a better option when you 
want to do fine-grained post-mortem anyway. Maybe we can modify the caller 
context to attach remote port so that it doesn't break api compatibility. Just 
a thought.
   
   
   
   > I haven't gone through the entire discussion/code. Just that whether we 
should modify the existing field or add a new one. Technically both are correct 
and I don't see any serious issue with either(not thinking too deep). But I 
feel for the parsers to adapt, if there was a new field, it might be little bit 
more easy, Rather than trying to figure out whether the existing field has a 
port or not. Just my thoughts, I am Ok with whichever way most people tend to 
agree. Anyway whatever we do should be optional & guarded by a config.
   
   Thanks @ayushtkn for your comments and suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669855)
Time Spent: 4h  (was: 3h 50m)

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16283) RBF: improve renewLease() to call only a specific NameNode rather than make fan-out calls

2021-10-25 Thread Aihua Xu (Jira)
Aihua Xu created HDFS-16283:
---

 Summary: RBF: improve renewLease() to call only a specific 
NameNode rather than make fan-out calls
 Key: HDFS-16283
 URL: https://issues.apache.org/jira/browse/HDFS-16283
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: rbf
Reporter: Aihua Xu
Assignee: Aihua Xu


Currently renewLease() against a router will make fan-out to all the NameNodes. 
Since renewLease() call is so frequent and if one of the NameNodes are slow, 
then eventually the router queues are blocked by all renewLease() and cause 
router degradation. 

We will make a change in the client side to keep track of NameNode Id in 
additional to current fileId so routers understand which NameNodes the client 
is renewing lease against.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread tomscut (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433702#comment-17433702
 ] 

tomscut commented on HDFS-16266:


[~weichiu] Thank you very much for your comments.

Currently, FairCallQueue is not enabled in our cluster. Indeed, without 
long-running connections, it's really hard to track tasks based on ports. So 
this should only be used as an auxiliary method.

In our production environment, we can indeed get applicationids (e.g. MR or 
Spark) for certain tasks based on the CallerContext and then trace the relevant 
users. It's very efficient. But sometimes, users perform HDFS operations on 
user-defined tasks. In such scenarios, the CallerContext may have no content. 
So we may find these tasks by "ip:port".

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite

2021-10-25 Thread Janus Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433687#comment-17433687
 ] 

Janus Chow commented on HDFS-16275:
---

[~ayushtkn] Thank you for your quick explanation.

In fact, I was curious and confused about the default "false" here. I thought 
the considerLoad is as same as the current conditions in `isGoodDatanode`, as 
exclude stale or exclude slow node, they are kind of no difference for locality 
or non-locality.

For configuration, since the considerLoad is from the config of 
"dfs.namenode.replication.considerLoad",  would 
"dfs.namenode.replication.locality.considerLoad" be a choice?

 

 

> [HDFS] Enable considerLoad for localWrite
> -
>
> Key: HDFS-16275
> URL: https://issues.apache.org/jira/browse/HDFS-16275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when client is on the same machine of a datanode, it will try to 
> write to the local machine regardless of the load of the datanode, that is 
> the xceiverCount.
> In our production cluster, datanode and Nodemanager are running on the same 
> server, so when there are heavy jobs running on a labeled queue, the 
> corresponding datanodes will have higher xceiverCounts than other datanodes. 
> When other clients are trying to write, the exception of "could only be 
> replicated to 0 nodes" would be thrown.
> This ticket is to enable considerLoad to avoid the hot local write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13514) BenchmarkThroughput.readLocalFile hangs with misconfigured BUFFER_SIZE

2021-10-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433680#comment-17433680
 ] 

Ayush Saxena commented on HDFS-13514:
-

Not sure which PR is the latest, Can you close the duplicate jiras and PR, and 
only keep the active PR open. If possible please extend a test as well

> BenchmarkThroughput.readLocalFile hangs with misconfigured BUFFER_SIZE
> --
>
> Key: HDFS-13514
> URL: https://issues.apache.org/jira/browse/HDFS-13514
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 2.5.0
>Reporter: John Doe
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the BUFFER_SIZE is configured to be 0, the while loop in 
> BenchmarkThroughput.readLocalFile function hangs endlessly.
> This is because when the data.size (i.e., BUFFER_SIZE) is 0, the val will 
> always be 0 by invoking val=in.read(data).
> Here is the code snippet.
> {code:java}
>   BUFFER_SIZE = conf.getInt("dfsthroughput.buffer.size", 4 * 1024);//when 
> dfsthroughput.buffer.size is configued to be 0
>   private void readLocalFile(Path path, String name, Configuration conf) 
> throws IOException {
> System.out.print("Reading " + name);
> resetMeasurements();
> InputStream in = new FileInputStream(new File(path.toString()));
> byte[] data = new byte[BUFFER_SIZE];
> long size = 0;
> while (size >= 0) {
>   size = in.read(data);
> }
> in.close();
> printMeasurements();
>   }
> {code}
> The similar case is HDFS-13513



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite

2021-10-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433678#comment-17433678
 ] 

Ayush Saxena commented on HDFS-16275:
-

Ohhk, By any chance have you explored AvailableSpaceBlockPlacementPolicy. That 
has a optimisation available for local node as well in form of a config 
{{dfs.namenode.available-space-block-placement-policy.balance-local-node}}

I haven't gone through the code, but the change proposed should be configurable 
& by default turned off, for backward compatibility 

 

> [HDFS] Enable considerLoad for localWrite
> -
>
> Key: HDFS-16275
> URL: https://issues.apache.org/jira/browse/HDFS-16275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when client is on the same machine of a datanode, it will try to 
> write to the local machine regardless of the load of the datanode, that is 
> the xceiverCount.
> In our production cluster, datanode and Nodemanager are running on the same 
> server, so when there are heavy jobs running on a labeled queue, the 
> corresponding datanodes will have higher xceiverCounts than other datanodes. 
> When other clients are trying to write, the exception of "could only be 
> replicated to 0 nodes" would be thrown.
> This ticket is to enable considerLoad to avoid the hot local write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16275) [HDFS] Enable considerLoad for localWrite

2021-10-25 Thread Janus Chow (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433671#comment-17433671
 ] 

Janus Chow commented on HDFS-16275:
---

[~ayushtkn] Thanks for the comment.

I think we do like to have data locality, only not a too hot one. IMHO when the 
node is not too hot, the locality should boost the performance. The change from 
default "false" is mainly for cooling the node down.

> [HDFS] Enable considerLoad for localWrite
> -
>
> Key: HDFS-16275
> URL: https://issues.apache.org/jira/browse/HDFS-16275
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when client is on the same machine of a datanode, it will try to 
> write to the local machine regardless of the load of the datanode, that is 
> the xceiverCount.
> In our production cluster, datanode and Nodemanager are running on the same 
> server, so when there are heavy jobs running on a labeled queue, the 
> corresponding datanodes will have higher xceiverCounts than other datanodes. 
> When other clients are trying to write, the exception of "could only be 
> replicated to 0 nodes" would be thrown.
> This ticket is to enable considerLoad to avoid the hot local write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16266?focusedWorklogId=669400=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-669400
 ]

ASF GitHub Bot logged work on HDFS-16266:
-

Author: ASF GitHub Bot
Created on: 25/Oct/21 08:11
Start Date: 25/Oct/21 08:11
Worklog Time Spent: 10m 
  Work Description: jojochuang commented on pull request #3538:
URL: https://github.com/apache/hadoop/pull/3538#issuecomment-950643667


   The API is declared Public, Evolving. If it stays in Hadoop 3.4.0 I am fine 
with it.
   
   We used to have an audit logger (Cloudera Navigator) that extends the 
AuditLogger interface. But we've moved away from that.
   
   Performance:
   It would have a slight performance penalty because every audit log op will 
always convert InetAddress to a string, regardless if audit logger is off 
(audit log level = debug or dfs.namenode.audit.log.debug.cmdlist has the 
excluded op)). It's probably acceptable since audit is logged outside of 
namenode lock.
   
   CallerContext:
the caller context is probably a better option when you want to do 
fine-grained post-mortem anyway. Maybe we can modify the caller context to 
attach remote port so that it doesn't break api compatibility. Just a thought.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 669400)
Time Spent: 3h 50m  (was: 3h 40m)

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16266) Add remote port information to HDFS audit log

2021-10-25 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433608#comment-17433608
 ] 

Wei-Chiu Chuang commented on HDFS-16266:


Thanks for reporting the issue and submitting the PR.

So, I have a few design/architectural comments other than the code attached in 
the PR.

1. for the abusive users, have you tried enable FairCallQueue to punish those 
bad users? If so, did you find it not sufficient to combat resource usage 
problem? Is it because the users issued recursive commands like 'du' 
(contentSummary) calls?
2. the current audit logger supports CallerContext. Applications (e.g. Hive) 
that support this semantics can attach a signature that is then passed from 
application to namenode. IMO this is a more explicit way to do post-moretem.

> Add remote port information to HDFS audit log
> -
>
> Key: HDFS-16266
> URL: https://issues.apache.org/jira/browse/HDFS-16266
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> In our production environment, we occasionally encounter a problem where a 
> user submits an abnormal computation task, causing a sudden flood of 
> requests, which causes the queueTime and processingTime of the Namenode to 
> rise very high, causing a large backlog of tasks.
> We usually locate and kill specific Spark, Flink, or MapReduce tasks based on 
> metrics and audit logs. Currently, IP and UGI are recorded in audit logs, but 
> there is no port information, so it is difficult to locate specific processes 
> sometimes. Therefore, I propose that we add the port information to the audit 
> log, so that we can easily track the upstream process.
> Currently, some projects contain port information in audit logs, such as 
> Hbase and Alluxio. I think it is also necessary to add port information for 
> HDFS audit logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org