[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688874#comment-17688874 ] ASF GitHub Bot commented on HDFS-16922: --- hfutatzhanghb commented on PR #5398: URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1430888618 > Requires a UT which can reproduce the said issue Hi, @ayushtkn , I have updated the description of this issue. please take a look~ thanks a lot. > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16921) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhangHB resolved HDFS-16921. Resolution: Duplicate > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16921 > URL: https://issues.apache.org/jira/browse/HDFS-16921 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Critical > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16920) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhangHB resolved HDFS-16920. Resolution: Duplicate > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16920 > URL: https://issues.apache.org/jira/browse/HDFS-16920 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Critical > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16919) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhangHB resolved HDFS-16919. Resolution: Duplicate > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16919 > URL: https://issues.apache.org/jira/browse/HDFS-16919 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Critical > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688863#comment-17688863 ] ASF GitHub Bot commented on HDFS-16922: --- hfutatzhanghb commented on PR #5398: URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1430844228 > hello, @ayushtkn . OK, i will try to construct a UT to reproduce this issue, and I will try to describe the issue in this page. > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688861#comment-17688861 ] ASF GitHub Bot commented on HDFS-16922: --- hfutatzhanghb opened a new pull request, #5398: URL: https://github.com/apache/hadoop/pull/5398 The current logic of IncrementalBlockReportManager# addRDBI method could lead to the missing blocks when datanodes in pipeline are I/O busy. > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16922: -- Labels: pull-request-available (was: ) > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.
[ https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688859#comment-17688859 ] ASF GitHub Bot commented on HDFS-16914: --- tomscut commented on code in PR #5381: URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106710999 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block, } // Ensure we record the new generation stamp getEditLog().logSync(); +LOG.info("bumpBlockGenerationStamp({}, client={}) success", +locatedBlock.getBlock(), clientName); Review Comment: > @tomscut We record block information, will there be a lot of logs? Is it changed to debug? I was worried about this before, but through offline discussions and testing with @hfutatzhanghb , we found that the logs were not too frequent. > Add some logs for updateBlockForPipeline RPC. > - > > Key: HDFS-16914 > URL: https://issues.apache.org/jira/browse/HDFS-16914 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Assignee: ZhangHB >Priority: Minor > Labels: pull-request-available > > Recently,we received an phone alarm about missing blocks. We found logs in > one datanode where the block was placed on like below: > > {code:java} > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: > /clientAddress:44638 dest: /localAddress:50010 of size 45733720 > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: > /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code} > the datanode received the same block with different generation stamp because > of socket timeout exception. blk_1305044966_231826462 is received from > upstream datanode in pipeline which has two datanodes. > blk_1305044966_231832415 is received from client directly. > > we have search all log info about blk_1305044966 in namenode and three > datanodes in original pipeline. but we could not obtain any helpful message > about the generation stamp 231826462. After diving into the source code, it > was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in > DataStreamer#setupPipelineInternal. The updateBlockForPipeline RPC does not > have any log info. So I think we should add some logs in this RPC. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
ZhangHB created HDFS-16922: -- Summary: The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy. Key: HDFS-16922 URL: https://issues.apache.org/jira/browse/HDFS-16922 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: ZhangHB The current logic of IncrementalBlockReportManager# addRDBI method could lead to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16921) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
ZhangHB created HDFS-16921: -- Summary: The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy. Key: HDFS-16921 URL: https://issues.apache.org/jira/browse/HDFS-16921 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.3.4 Reporter: ZhangHB The current logic of IncrementalBlockReportManager# addRDBI method could lead to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16920) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
ZhangHB created HDFS-16920: -- Summary: The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy. Key: HDFS-16920 URL: https://issues.apache.org/jira/browse/HDFS-16920 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.3.4 Reporter: ZhangHB The current logic of IncrementalBlockReportManager# addRDBI method could lead to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16919) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
ZhangHB created HDFS-16919: -- Summary: The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy. Key: HDFS-16919 URL: https://issues.apache.org/jira/browse/HDFS-16919 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.3.4 Reporter: ZhangHB The current logic of IncrementalBlockReportManager# addRDBI method could lead to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688857#comment-17688857 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430828452 > If the datanode is connected to observer namenode, it can serve requests, why we need to shutdown The observer namenode takes a different condition. I was actually thinking about making this include observer namenode too i.e. if datanode has not received heartbeat from observer or active namenode in the last e.g. 30s or so, then it should shutdown. This is an option, no issues with it. > Even if it is connected to standby, a failover happens and it will be in good shape, else if you restart a bunch of datanodes, the new namenode will be flooded by block reports and just increasing problems. This problem would occur only if we select reasonably lower number. The recommendation for this config value is high enough to include extra time duration for namenode failover. > If something gets messed up with Active namenode, you shutdown all, the BR are already heavy, you forced all other namenodes to handle them again, making failover more difficult. and if it is some faulty datanodes which lost connection, you didn't get that alarmed, and all Standby and Observers will keep on getting flooded by BRs, so in case Active NN literally dies and tries to failover to any of the Namenode which these Datanodes were connected, will be fed with unnecessary loads of BlockReports. (BR has an option of initial delay as well, it isn't like all bombard at once and you are sorted in 5-10 mins) The moment when active namenode becomes messy, or dies, this is exactly what can impact the availability of the hdfs service. So either we have Observer namenode take care of read requests in the meantime or the failover needs to happen. If neither of that happens, it's the datanode that is not really useful by staying the in cluster for longer duration. Let's say namenode gets bad and failover does take time, the new active one is anyways going to take time processing BRs right? > If something got messed with the datanode, that is why it isn't able to connect to Active. If something is in Memory not persisted to disk, or some JMX parameter or N/W parameters which can be used to figure out things gets lost. Do you mean hsync vs hflush kind of thing for in prgress files? Is that not already taken care of? > That is the reason most cluster administrator in not so cool situations, show XYZ datanode is unhealthy or not, if in some case they don't it should be handled over there. The response would take time from the cluster admin applications. Why not get auto healed by datanode? Also it's not that this change is going to terminate the datanode, it's going to shut down properly. > In case of shared datanodes in a federated setup, say it is connected to Active for one Namespace and has completely lost touch with another, then? Restart to get both working? Don't restart so that at least one stays working? Both are correct in there own ways and situation and the datanode shouldn't be in a state to decide its fate for such reasons. IMO any namespace that is not connected to active namenode is not up for serving requests from active namenode and hence it's not in good state. I got your point but the health of a datanode should be determined based on whether all BPs are connected to active in the federated setup, is that not the real factor determining the health of datanode? > Making anything configurable doesn't justify having it in. if we are letting any user to use this via any config as well, then we should be sure enough it is necessary and good thing to do, we can not say ohh you configured it, now it is your problem... I am not making claim only based on making this configurable feature. But it is reasonable enough to determine best course of action for given situation. The only recommendation I have is: user should be able to get the datanode to decide whether it should shutdown gracefully when it has not heard anything from active or observer namenode for the past x sec (50/60s or so). I have tried my best to answer above questions. Please also take a look at the Jira/PR description where this idea has been taken from. We have seen issues with specific infra and until manually shutting down datanodes, we don't see any hope for improving availability, this has happened at multiple times. Please keep in mind that cluster administrators in cloud native env do not have access to JMX metrics due to the security constraints. Really appreciate all your points and suggestions Ayush, please take a look again. >
[jira] [Updated] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16917: -- Labels: pull-request-available (was: ) > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the distribution of > data transfer rate for datanode reads. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688856#comment-17688856 ] ASF GitHub Bot commented on HDFS-16917: --- rdingankar opened a new pull request, #5397: URL: https://github.com/apache/hadoop/pull/5397 ### Description of PR Transfer rate metric for datanode reads will be calculated as the rate at which bytes are read ( bytes per ms ) With quantiles we will get a distribution of this rate which will be helpful in identifying slow datanodes. ### How was this patch tested? ### For code changes: - [ Y] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ NA] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ NA] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ NA] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the distribution of > data transfer rate for datanode reads. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688840#comment-17688840 ] ASF GitHub Bot commented on HDFS-16918: --- ayushtkn commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430800093 From Admin I mean cluster Administrator services, they can keep a track of datanodes and decide on what needs to be done to the datanode. If those services can shoot a restart if the datanode is shutdown, they can track in which situation the datanode needs to restarted. Not checking the code, but comments: - If the datanode is connected to observer namenode, it can serve requests, why we need to shutdown, - Even if it is connected to standby, a failover happens and it will be in good shape, else if you restart a bunch of datanodes, the new namenode will be flooded by block reports and just increasing problems. - If something gets messed up with Active namenode, you shutdown all, the BR are already heavy, you forced all other namenodes to handle them again, making failover more difficult. and if it is some faulty datanodes which lost connection, you didn't get that alarmed, and all Standby and Observers will keep on getting flooded by BRs, so in case Active NN literally dies and tries to failover to any of the Namenode which these Datanodes were connected, will be fed with unnecessary loads of BlockReports. (BR has an option of initial delay as well, it isn't like all bombard at once and you are sorted in 5-10 mins) - If something got messed with the datanode, that is why it isn't able to connect to Active. If something is in Memory not persisted to disk, or some JMX parameter or N/W parameters which can be used to figure out things gets lost. - That is the reason most cluster administrator in not so cool situations, show XYZ datanode is unhealthy or not, if in some case they don't it should be handled over there. - In case of shared datanodes in a federated setup, say it is connected to Active for one Namespace and has completely lost touch with another, then? Restart to get both working? Don't restart so that at least one stays working? Both are correct in there own ways and situation and the datanode shouldn't be in a state to decide its fate for such reasons. We do terminate Namenode is a bunch of conditions for sure, I don't want to get deep into those reasons, it is more or less preventive measure to terminate Namenode, if something serious has happened. This by architecture of HDFS itself isn't look very valid for HDFS. PS. Making anything configurable doesn't justify having it in. if we are letting any user to use this via any config as well, then we should be sure enough it is necessary and good thing to do, we can not say ohh you configured it, now it is your problem... I would say it is just pulling those cluster administrator things to datanode, like what Cloudera Manager or may be Ambari should do. Not in favour of this... > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688827#comment-17688827 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430775234 For this change, the entire behavior is optional: ``` dfs.datanode.health.activennconnect.timeout 0 If the value is greater than 0, each datanode would try to determine if it is healthy i.e. all block pools are correctly initialized and able to heartbeat to active namenode. At any given time, if the datanode looses connection to active namenode for the duration of milliseconds represented by the value of this config, it will attempt to shut down itself. If the value is 0, datanode would not perform any such checks. ``` Without providing non-default value for this config, this behavior does not take any effect. > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688828#comment-17688828 ] ASF GitHub Bot commented on HDFS-16896: --- hadoop-yetus commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1430775324 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 46s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 25s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 31m 2s | | trunk passed | | +1 :green_heart: | compile | 6m 12s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 5m 46s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 21s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 31s | | trunk passed | | +1 :green_heart: | javadoc | 1m 51s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 18s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 55s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 40s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 41s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 21s | | the patch passed | | +1 :green_heart: | compile | 5m 52s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 5m 52s | | the patch passed | | +1 :green_heart: | compile | 5m 37s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 37s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 4s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 1 new + 31 unchanged - 0 fixed = 32 total (was 31) | | +1 :green_heart: | mvnsite | 2m 13s | | the patch passed | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 3s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 54s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 56s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 26s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 206m 23s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 360m 1s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.TestPread | | | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5322 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c1560d884f2a 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688826#comment-17688826 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430773119 In the large fleet of datanodes, any datanode that does not stay connected to active namenode due to any connectivity issue, can choose to get itself shutdown rather than impacting the availability and that is wise thing for datanode itself to do, of course not mandatorily but as an opt-in behavior. While admin can do that, no human interaction can be fast enough to take action in the large scale cluster in just a matter of few seconds. > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16918: -- Labels: pull-request-available (was: ) > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688810#comment-17688810 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani opened a new pull request, #5396: URL: https://github.com/apache/hadoop/pull/5396 While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
Viraj Jasani created HDFS-16918: --- Summary: Optionally shut down datanode if it does not stay connected to active namenode Key: HDFS-16918 URL: https://issues.apache.org/jira/browse/HDFS-16918 Project: Hadoop HDFS Issue Type: New Feature Reporter: Viraj Jasani Assignee: Viraj Jasani While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688787#comment-17688787 ] ASF GitHub Bot commented on HDFS-16761: --- ayushtkn commented on PR #5390: URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1430673981 Was just gonna hit the merge button, so thought of just trying and reproducing this issue locally as well. But it didn't repro https://user-images.githubusercontent.com/25608848/218913802-65e390ed-4f9e-445b-846a-f9dd1e8542d7.png;> The dead datanode row has one less column red, but it isn't going to the startup page as the original ticket mentioned. Anyone with any pointers? The change is this PR is still relevant I think, but just curious if the issue reported is correct or not, so we can change the title and then merge > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-16761: Fix Version/s: (was: 3.2.2) > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
Ravindra Dingankar created HDFS-16917: - Summary: Add transfer rate quantile metrics for DataNode reads Key: HDFS-16917 URL: https://issues.apache.org/jira/browse/HDFS-16917 Project: Hadoop HDFS Issue Type: Task Components: datanode Reporter: Ravindra Dingankar Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the distribution of data transfer rate for datanode reads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.
[ https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688751#comment-17688751 ] ASF GitHub Bot commented on HDFS-16914: --- hfutatzhanghb commented on code in PR #5381: URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106495693 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block, } // Ensure we record the new generation stamp getEditLog().logSync(); +LOG.info("bumpBlockGenerationStamp({}, client={}) success", +locatedBlock.getBlock(), clientName); Review Comment: @slfan1989 hi, thanks for your review, the frequency of bumpBlockGenerationStamp logs is approximately equal to the frequency of updatePipeline. So, i think we can use INFO level here. > Add some logs for updateBlockForPipeline RPC. > - > > Key: HDFS-16914 > URL: https://issues.apache.org/jira/browse/HDFS-16914 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Assignee: ZhangHB >Priority: Minor > Labels: pull-request-available > > Recently,we received an phone alarm about missing blocks. We found logs in > one datanode where the block was placed on like below: > > {code:java} > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: > /clientAddress:44638 dest: /localAddress:50010 of size 45733720 > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: > /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code} > the datanode received the same block with different generation stamp because > of socket timeout exception. blk_1305044966_231826462 is received from > upstream datanode in pipeline which has two datanodes. > blk_1305044966_231832415 is received from client directly. > > we have search all log info about blk_1305044966 in namenode and three > datanodes in original pipeline. but we could not obtain any helpful message > about the generation stamp 231826462. After diving into the source code, it > was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in > DataStreamer#setupPipelineInternal. The updateBlockForPipeline RPC does not > have any log info. So I think we should add some logs in this RPC. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.
[ https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688749#comment-17688749 ] ASF GitHub Bot commented on HDFS-16914: --- slfan1989 commented on code in PR #5381: URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106493967 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block, } // Ensure we record the new generation stamp getEditLog().logSync(); +LOG.info("bumpBlockGenerationStamp({}, client={}) success", +locatedBlock.getBlock(), clientName); Review Comment: @tomscut We record block information, will there be a lot of logs? Is it changed to debug? > Add some logs for updateBlockForPipeline RPC. > - > > Key: HDFS-16914 > URL: https://issues.apache.org/jira/browse/HDFS-16914 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode >Affects Versions: 3.3.4 >Reporter: ZhangHB >Assignee: ZhangHB >Priority: Minor > Labels: pull-request-available > > Recently,we received an phone alarm about missing blocks. We found logs in > one datanode where the block was placed on like below: > > {code:java} > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: > /clientAddress:44638 dest: /localAddress:50010 of size 45733720 > 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: > /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code} > the datanode received the same block with different generation stamp because > of socket timeout exception. blk_1305044966_231826462 is received from > upstream datanode in pipeline which has two datanodes. > blk_1305044966_231832415 is received from client directly. > > we have search all log info about blk_1305044966 in namenode and three > datanodes in original pipeline. but we could not obtain any helpful message > about the generation stamp 231826462. After diving into the source code, it > was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in > DataStreamer#setupPipelineInternal. The updateBlockForPipeline RPC does not > have any log info. So I think we should add some logs in this RPC. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688746#comment-17688746 ] ASF GitHub Bot commented on HDFS-16896: --- mccormickt12 commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1430524690 > The change generally looks good to me. > > One concern is that when `refetchLocation` is called inside `hedgedFetchBlockByteRange` we could remove a node from the ignoredList that is already part of the futures array. This would lead to multiple reads to the same node. > > What I'm thinking of is > > 1. Node A is added to futures array > 2. getFirstToComplete throws an InterruptedException when doing hedgedService.take() > 3. We call retchLocation, which remove node A from ignored list. > 4. The while loop re-adds Node A to the futures list. > > I don't know if this actually can happen. Even if it is technically possible, it may not be an issue. Thoughts? yes @simbadzina I agree this was an issue. I've resolved this now. As you pointed out the future is removed in `getFirstToComplete`, so now we let it spin in the while loop, each time a future will be removed, and then once futures is empty and refetch is needed we will clear the ignore list > HDFS Client hedged read has increased failure rate than without hedged read > --- > > Key: HDFS-16896 > URL: https://issues.apache.org/jira/browse/HDFS-16896 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Reporter: Tom McCormick >Assignee: Tom McCormick >Priority: Major > Labels: pull-request-available > > When hedged read is enabled by HDFS client, we see an increased failure rate > on reads. > *stacktrace* > > {code:java} > Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain > block: BP-1183972111-10.197.192.88-1590025572374:blk_17114848218_16043459722 > file=/data/tracking/streaming/AdImpressionEvent/daily/2022/07/18/compaction_1/part-r-1914862.1658217125623.1362294472.orc > at > org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1077) > at > org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1060) > at > org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1039) > at > org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1365) > at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1572) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1535) > at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112) > at > org.apache.hadoop.fs.RetryingInputStream.lambda$readFully$3(RetryingInputStream.java:172) > at org.apache.hadoop.fs.RetryPolicy.lambda$run$0(RetryPolicy.java:137) > at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) > at org.apache.hadoop.fs.RetryPolicy.run(RetryPolicy.java:136) > at > org.apache.hadoop.fs.RetryingInputStream.readFully(RetryingInputStream.java:168) > at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112) > at > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112) > at > io.trino.plugin.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:76) > ... 46 more > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688529#comment-17688529 ] ASF GitHub Bot commented on HDFS-16915: --- hfutatzhanghb commented on PR #5392: URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429855095 > @Hexiaoqiao , hi, could you please take a look at this. For the convenience of the existing metrics name, i keep the metric names as original. > Optimize metrics for operations hold lock times of FsDatasetImpl > > > Key: HDFS-16915 > URL: https://issues.apache.org/jira/browse/HDFS-16915 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > Current calculation method also includes the time of waiting lock. So, i > think we should optimize the compute method of metrics for operations hold > lock times of FsDatasetImpl. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688520#comment-17688520 ] ASF GitHub Bot commented on HDFS-16915: --- hadoop-yetus commented on PR #5392: URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429817139 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 55s | | trunk passed | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 7s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 8s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 18s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 19s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 53s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 52 unchanged - 0 fixed = 53 total (was 52) | | +1 :green_heart: | mvnsite | 1m 22s | | the patch passed | | +1 :green_heart: | javadoc | 0m 51s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 35s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 46s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 213m 21s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 50s | | The patch does not generate ASF License warnings. | | | | 332m 23s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5392 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 9d84f26e215a 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 0620f38049cfeeac016452c84324c2e05625c234 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
[jira] [Created] (HDFS-16916) Improve the use of JUnit Test in DFSClient
Hualong Zhang created HDFS-16916: Summary: Improve the use of JUnit Test in DFSClient Key: HDFS-16916 URL: https://issues.apache.org/jira/browse/HDFS-16916 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Affects Versions: 3.4.0 Reporter: Hualong Zhang Improve the use of JUnit Test in DFSClient -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688463#comment-17688463 ] ASF GitHub Bot commented on HDFS-16761: --- hadoop-yetus commented on PR #5390: URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1429581205 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 31s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 31m 46s | | trunk passed | | +1 :green_heart: | shadedclient | 72m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | shadedclient | 25m 20s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 104m 2s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5390/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5390 | | Optional Tests | dupname asflicense shadedclient codespell detsecrets xmllint | | uname | Linux d3bf8d0f5cca 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 249046de348e4e82924e0a4f3c54b21156505cdc | | Max. process+thread count | 566 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5390/2/console | | versions | git=2.25.1 maven=3.6.3 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > Fix For: 3.2.2 > > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16673) Fix usage of chown
[ https://issues.apache.org/jira/browse/HDFS-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688459#comment-17688459 ] ASF GitHub Bot commented on HDFS-16673: --- GuoPhilipse closed pull request #4602: HDFS-16673. Fix usage of chown URL: https://github.com/apache/hadoop/pull/4602 > Fix usage of chown > -- > > Key: HDFS-16673 > URL: https://issues.apache.org/jira/browse/HDFS-16673 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation >Affects Versions: 3.3.3 >Reporter: guophilipse >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > actually `chown` command can be used for the owner of the files or the super > user, we need to correct the doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688398#comment-17688398 ] ASF GitHub Bot commented on HDFS-16761: --- dombizita commented on PR #5390: URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1429432336 thanks for the review @goiri and @ayushtkn, I updated my patch with the suggested change in `federationhealth.html`. > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > Fix For: 3.2.2 > > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688367#comment-17688367 ] ASF GitHub Bot commented on HDFS-16915: --- hfutatzhanghb commented on PR #5392: URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429338146 @Hexiaoqiao , hi, could you please take a look at this. > Optimize metrics for operations hold lock times of FsDatasetImpl > > > Key: HDFS-16915 > URL: https://issues.apache.org/jira/browse/HDFS-16915 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > Current calculation method also includes the time of waiting lock. So, i > think we should optimize the compute method of metrics for operations hold > lock times of FsDatasetImpl. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16915: -- Labels: pull-request-available (was: ) > Optimize metrics for operations hold lock times of FsDatasetImpl > > > Key: HDFS-16915 > URL: https://issues.apache.org/jira/browse/HDFS-16915 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > Current calculation method also includes the time of waiting lock. So, i > think we should optimize the compute method of metrics for operations hold > lock times of FsDatasetImpl. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688364#comment-17688364 ] ASF GitHub Bot commented on HDFS-16915: --- hfutatzhanghb opened a new pull request, #5392: URL: https://github.com/apache/hadoop/pull/5392 JIRA: https://issues.apache.org/jira/browse/HDFS-16915 Current calculation method also includes the time of waiting lock. So, i think we should optimize the compute method of metrics for operations hold lock times of FsDatasetImpl. > Optimize metrics for operations hold lock times of FsDatasetImpl > > > Key: HDFS-16915 > URL: https://issues.apache.org/jira/browse/HDFS-16915 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.4 >Reporter: ZhangHB >Priority: Major > > Current calculation method also includes the time of waiting lock. So, i > think we should optimize the compute method of metrics for operations hold > lock times of FsDatasetImpl. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl
ZhangHB created HDFS-16915: -- Summary: Optimize metrics for operations hold lock times of FsDatasetImpl Key: HDFS-16915 URL: https://issues.apache.org/jira/browse/HDFS-16915 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.3.4 Reporter: ZhangHB Current calculation method also includes the time of waiting lock. So, i think we should optimize the compute method of metrics for operations hold lock times of FsDatasetImpl. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.
[ https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688350#comment-17688350 ] ASF GitHub Bot commented on HDFS-16914: --- hadoop-yetus commented on PR #5381: URL: https://github.com/apache/hadoop/pull/5381#issuecomment-1429286505 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 53s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 47m 10s | | trunk passed | | +1 :green_heart: | compile | 1m 28s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 7s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 31s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 32s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 34s | | trunk passed | | +1 :green_heart: | shadedclient | 28m 57s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 54s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 21s | | the patch passed | | +1 :green_heart: | javadoc | 0m 52s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 26s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 31s | | the patch passed | | +1 :green_heart: | shadedclient | 28m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 227m 34s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | | The patch does not generate ASF License warnings. | | | | 354m 51s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5381 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 4c42e84abc99 4.15.0-197-generic #208-Ubuntu SMP Tue Nov 1 17:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 26a24245b3bd80b8f4fde78ad252e0a96f2fb865 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/testReport/ | | Max. process+thread count | 2122 (vs. ulimit of 5500) | | modules | C: