[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop
[ https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=524282=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524282 ] ASF GitHub Bot logged work on HDFS-15624: - Author: ASF GitHub Bot Created on: 15/Dec/20 07:08 Start Date: 15/Dec/20 07:08 Worklog Time Spent: 10m Work Description: huangtianhua commented on pull request #2377: URL: https://github.com/apache/hadoop/pull/2377#issuecomment-745100099 @ayushtkn Hi, maybe you can help to review this :)? Thanks very much This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524282) Time Spent: 7h 20m (was: 7h 10m) > Fix the SetQuotaByStorageTypeOp problem after updating hadoop > --- > > Key: HDFS-15624 > URL: https://issues.apache.org/jira/browse/HDFS-15624 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.4.0 >Reporter: YaYun Wang >Priority: Major > Labels: pull-request-available, release-blocker > Time Spent: 7h 20m > Remaining Estimate: 0h > > HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum > of StorageType. And, setting the quota by storageType depends on the > ordinal(), therefore, it may cause the setting of quota to be invalid after > upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249504#comment-17249504 ] Ayush Saxena commented on HDFS-15170: - Thanx [~weichiu] for the commit. This broke the compilation in 3.2 and 3.1, since the conf wasn't present there, {{DFS_NAMENODE_CORRUPT_BLOCK_DELETE_IMMEDIATELY_ENABLED}} Have quickly backported HDFS-15200 and HDFS-15187, seems things are working(Just ran the related UT's) > EC: Block gets marked as CORRUPT in case of failover and pipeline recovery > -- > > Key: HDFS-15170 > URL: https://issues.apache.org/jira/browse/HDFS-15170 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15170-01.patch, HDFS-15170-02.patch, > HDFS-15170-03.patch, HDFS-15170-04.patch > > > Steps to Repro : > 1. Start writing a EC file. > 2. After more than one stripe has been written, stop one datanode. > 3. Post pipeline recovery, keep on writing the data. > 4.Close the file. > 5. transition the namenode to standby and back to active. > 6. Turn on the shutdown datanode in step 2 > The BR from datanode 2 will make the block corrupt and during invalidate > block won't remove it, since post failover the blocks would be on stale > storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage
[ https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249502#comment-17249502 ] Ayush Saxena commented on HDFS-15200: - backported to 3.2 and 3.1 > Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage > - > > Key: HDFS-15200 > URL: https://issues.apache.org/jira/browse/HDFS-15200 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15200-01.patch, HDFS-15200-02.patch, > HDFS-15200-03.patch, HDFS-15200-04.patch, HDFS-15200-05.patch > > > Presently {{invalidateBlock(..)}} before adding a replica into invalidates, > checks whether any block replica is on stale storage, if any replica is on > stale storage, it postpones deletion of the replica. > Here : > {code:java} >// Check how many copies we have of the block > if (nr.replicasOnStaleNodes() > 0) { > blockLog.debug("BLOCK* invalidateBlocks: postponing " + > "invalidation of {} on {} because {} replica(s) are located on " + > "nodes with potentially out-of-date block reports", b, dn, > nr.replicasOnStaleNodes()); > postponeBlock(b.getCorrupted()); > return false; > {code} > > In case of corrupt replica, we can skip this logic and delete the corrupt > replica immediately, as a corrupt replica can't get corrected. > One outcome of this behavior presently is namenodes showing different block > states post failover, as: > If a replica is marked corrupt, the Active NN, will mark it as corrupt, and > mark it for deletion and remove it from corruptReplica's and > excessRedundancyMap. > If before the deletion of replica, Failover happens. > The standby Namenode will mark all the storages as stale. > Then will start processing IBR's, Now since the replica's would be on stale > storage, it will skip deletion, and removal from corruptReplica's > Hence both the namenode will show different numbers and different corrupt > replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage
[ https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15200: Fix Version/s: 3.2.3 3.1.5 > Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage > - > > Key: HDFS-15200 > URL: https://issues.apache.org/jira/browse/HDFS-15200 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15200-01.patch, HDFS-15200-02.patch, > HDFS-15200-03.patch, HDFS-15200-04.patch, HDFS-15200-05.patch > > > Presently {{invalidateBlock(..)}} before adding a replica into invalidates, > checks whether any block replica is on stale storage, if any replica is on > stale storage, it postpones deletion of the replica. > Here : > {code:java} >// Check how many copies we have of the block > if (nr.replicasOnStaleNodes() > 0) { > blockLog.debug("BLOCK* invalidateBlocks: postponing " + > "invalidation of {} on {} because {} replica(s) are located on " + > "nodes with potentially out-of-date block reports", b, dn, > nr.replicasOnStaleNodes()); > postponeBlock(b.getCorrupted()); > return false; > {code} > > In case of corrupt replica, we can skip this logic and delete the corrupt > replica immediately, as a corrupt replica can't get corrected. > One outcome of this behavior presently is namenodes showing different block > states post failover, as: > If a replica is marked corrupt, the Active NN, will mark it as corrupt, and > mark it for deletion and remove it from corruptReplica's and > excessRedundancyMap. > If before the deletion of replica, Failover happens. > The standby Namenode will mark all the storages as stale. > Then will start processing IBR's, Now since the replica's would be on stale > storage, it will skip deletion, and removal from corruptReplica's > Hence both the namenode will show different numbers and different corrupt > replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Fix Version/s: 3.2.3 3.1.5 > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch, > HDFS-15187-03.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249500#comment-17249500 ] Ayush Saxena commented on HDFS-15187: - Cherry-picked to 3.2 and 3.1 > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch, > HDFS-15187-03.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249479#comment-17249479 ] Ayush Saxena commented on HDFS-15728: - Merged PR to trunk > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HDFS-15728. - Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15170: --- Fix Version/s: 3.2.3 3.1.5 3.4.0 3.3.1 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks Ayush for the patch. The change is merged into trunk all the way to branch-3.1 > EC: Block gets marked as CORRUPT in case of failover and pipeline recovery > -- > > Key: HDFS-15170 > URL: https://issues.apache.org/jira/browse/HDFS-15170 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3 > > Attachments: HDFS-15170-01.patch, HDFS-15170-02.patch, > HDFS-15170-03.patch, HDFS-15170-04.patch > > > Steps to Repro : > 1. Start writing a EC file. > 2. After more than one stripe has been written, stop one datanode. > 3. Post pipeline recovery, keep on writing the data. > 4.Close the file. > 5. transition the namenode to standby and back to active. > 6. Turn on the shutdown datanode in step 2 > The BR from datanode 2 will make the block corrupt and during invalidate > block won't remove it, since post failover the blocks would be on stale > storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?focusedWorklogId=524239=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524239 ] ASF GitHub Bot logged work on HDFS-15728: - Author: ASF GitHub Bot Created on: 15/Dec/20 04:42 Start Date: 15/Dec/20 04:42 Worklog Time Spent: 10m Work Description: ayushtkn edited a comment on pull request #2534: URL: https://github.com/apache/hadoop/pull/2534#issuecomment-745047847 Merged to trunk. @liuyanpunk Thanx for the contribution and next time please raise a PR against trunk branch not master. Anyway this time I have sorted it out. Thanx @jojochuang for the review!!! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524239) Time Spent: 50m (was: 40m) > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?focusedWorklogId=524238=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524238 ] ASF GitHub Bot logged work on HDFS-15728: - Author: ASF GitHub Bot Created on: 15/Dec/20 04:41 Start Date: 15/Dec/20 04:41 Worklog Time Spent: 10m Work Description: ayushtkn commented on pull request #2534: URL: https://github.com/apache/hadoop/pull/2534#issuecomment-745047847 Merged to trunk. @liuyanpunk Thanx for the contribution and next time please raise a PR against trunk branch not master. Anyway this time I have it sorted out. Thanx @jojochuang for the review!!! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524238) Time Spent: 40m (was: 0.5h) > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?focusedWorklogId=524234=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524234 ] ASF GitHub Bot logged work on HDFS-15728: - Author: ASF GitHub Bot Created on: 15/Dec/20 04:21 Start Date: 15/Dec/20 04:21 Worklog Time Spent: 10m Work Description: ayushtkn merged pull request #2534: URL: https://github.com/apache/hadoop/pull/2534 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524234) Time Spent: 0.5h (was: 20m) > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15728) Update description of dfs.datanode.handler.count in hdfs-default.xml
[ https://issues.apache.org/jira/browse/HDFS-15728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15728: Summary: Update description of dfs.datanode.handler.count in hdfs-default.xml (was: Updating definition to dfs.datanode.handler.count documentation description.) > Update description of dfs.datanode.handler.count in hdfs-default.xml > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15728) Updating definition to dfs.datanode.handler.count documentation description.
[ https://issues.apache.org/jira/browse/HDFS-15728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena reassigned HDFS-15728: --- Assignee: liuyan > Updating definition to dfs.datanode.handler.count documentation description. > > > Key: HDFS-15728 > URL: https://issues.apache.org/jira/browse/HDFS-15728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: configuration >Reporter: liuyan >Assignee: liuyan >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > As for dfs.namenode.handler.count we documented with below description > _"The number of Namenode RPC server threads that listen to requests from > clients. If dfs.namenode.servicerpc-address is not configured then Namenode > RPC server threads listen to requests from all nodes."_ > however, for dfs.datanode.handler.count we documented with below description > _"The number of server threads for the datanode."_ > > The purpose of this Jira is to update the description for > dfs.datanode.handler.count with > _"The number of Datanode RPC server threads that listen to requests from > client."_ to make it more readable to the users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15729) Show progress of Balancer in Namenode UI
[ https://issues.apache.org/jira/browse/HDFS-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249471#comment-17249471 ] Srinivasu Majeti commented on HDFS-15729: - Sure [~weichiu], Any UI thats useful for tracking progress :) . Thank you. > Show progress of Balancer in Namenode UI > > > Key: HDFS-15729 > URL: https://issues.apache.org/jira/browse/HDFS-15729 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Affects Versions: 3.1.4 >Reporter: Srinivasu Majeti >Priority: Major > > It would be nice to have a tracking of Balancer process in the Namenode UI to > show if something is running and what is the progress to show current status > . This would be similar to Namenode startup progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249431#comment-17249431 ] Hadoop QA commented on HDFS-15170: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 47s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 33m 10s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 17s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 55s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 7s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 5s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 11s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 6s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 50s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 21s{color} | {color:green}{color} | {color:green} the
[jira] [Work logged] (HDFS-15704) Mitigate lease monitor's rapid infinite loop
[ https://issues.apache.org/jira/browse/HDFS-15704?focusedWorklogId=524201=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524201 ] ASF GitHub Bot logged work on HDFS-15704: - Author: ASF GitHub Bot Created on: 15/Dec/20 01:37 Start Date: 15/Dec/20 01:37 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2511: URL: https://github.com/apache/hadoop/pull/2511#issuecomment-744977311 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 27s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | | 0m 0s | [test4tests](test4tests) | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 35m 51s | | trunk passed | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | checkstyle | 0m 48s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 20s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 20s | | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 55s | | trunk passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 1m 26s | | trunk passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +0 :ok: | spotbugs | 3m 16s | | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 3m 12s | | trunk passed | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 10s | | the patch passed | | +1 :green_heart: | compile | 1m 14s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javac | 1m 14s | | hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 generated 0 new + 599 unchanged - 3 fixed = 599 total (was 602) | | +1 :green_heart: | compile | 1m 6s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | javac | 1m 6s | | hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 generated 0 new + 583 unchanged - 3 fixed = 583 total (was 586) | | +1 :green_heart: | checkstyle | 0m 41s | | hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 26 unchanged - 1 fixed = 26 total (was 27) | | +1 :green_heart: | mvnsite | 1m 14s | | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | | The patch has no whitespace issues. | | +1 :green_heart: | shadedclient | 16m 50s | | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 50s | | the patch passed with JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 | | +1 :green_heart: | javadoc | 1m 22s | | the patch passed with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | +1 :green_heart: | findbugs | 3m 17s | | the patch passed | _ Other Tests _ | | +1 :green_heart: | unit | 117m 8s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 38s | | The patch does not generate ASF License warnings. | | | | 214m 23s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2511/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2511 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a60018bc009b 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3234e5eaf36 | | Default Java | Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.18.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_275-8u275-b01-0ubuntu1~18.04-b01 | | Test
[jira] [Commented] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249354#comment-17249354 ] Wei-Chiu Chuang commented on HDFS-15170: I'll commit the patch. The 04 patch is essentially 03 patch with one word removed in the comment. Attach here for future reference. > EC: Block gets marked as CORRUPT in case of failover and pipeline recovery > -- > > Key: HDFS-15170 > URL: https://issues.apache.org/jira/browse/HDFS-15170 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15170-01.patch, HDFS-15170-02.patch, > HDFS-15170-03.patch, HDFS-15170-04.patch > > > Steps to Repro : > 1. Start writing a EC file. > 2. After more than one stripe has been written, stop one datanode. > 3. Post pipeline recovery, keep on writing the data. > 4.Close the file. > 5. transition the namenode to standby and back to active. > 6. Turn on the shutdown datanode in step 2 > The BR from datanode 2 will make the block corrupt and during invalidate > block won't remove it, since post failover the blocks would be on stale > storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15170: --- Attachment: HDFS-15170-04.patch > EC: Block gets marked as CORRUPT in case of failover and pipeline recovery > -- > > Key: HDFS-15170 > URL: https://issues.apache.org/jira/browse/HDFS-15170 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15170-01.patch, HDFS-15170-02.patch, > HDFS-15170-03.patch, HDFS-15170-04.patch > > > Steps to Repro : > 1. Start writing a EC file. > 2. After more than one stripe has been written, stop one datanode. > 3. Post pipeline recovery, keep on writing the data. > 4.Close the file. > 5. transition the namenode to standby and back to active. > 6. Turn on the shutdown datanode in step 2 > The BR from datanode 2 will make the block corrupt and during invalidate > block won't remove it, since post failover the blocks would be on stale > storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15170) EC: Block gets marked as CORRUPT in case of failover and pipeline recovery
[ https://issues.apache.org/jira/browse/HDFS-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249350#comment-17249350 ] Wei-Chiu Chuang commented on HDFS-15170: We should move forward with the fix... even though HDFS-15200 mostly covered it in trunk, we should fix it too in lower releases. > EC: Block gets marked as CORRUPT in case of failover and pipeline recovery > -- > > Key: HDFS-15170 > URL: https://issues.apache.org/jira/browse/HDFS-15170 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15170-01.patch, HDFS-15170-02.patch, > HDFS-15170-03.patch > > > Steps to Repro : > 1. Start writing a EC file. > 2. After more than one stripe has been written, stop one datanode. > 3. Post pipeline recovery, keep on writing the data. > 4.Close the file. > 5. transition the namenode to standby and back to active. > 6. Turn on the shutdown datanode in step 2 > The BR from datanode 2 will make the block corrupt and during invalidate > block won't remove it, since post failover the blocks would be on stale > storage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15716) TestUpgradeDomainBlockPlacementPolicy flaky
[ https://issues.apache.org/jira/browse/HDFS-15716?focusedWorklogId=524136=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524136 ] ASF GitHub Bot logged work on HDFS-15716: - Author: ASF GitHub Bot Created on: 14/Dec/20 21:59 Start Date: 14/Dec/20 21:59 Worklog Time Spent: 10m Work Description: amahussein commented on pull request #2528: URL: https://github.com/apache/hadoop/pull/2528#issuecomment-744736690 > The list of failed unit tests in the last few days is getting worse and worse. > @amahussein, you've been making lots of fixes in the last month; any idea why is this suddenly getting so bad? Thanks @goiri. I took a look at the build latest https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/355/ ```bash Test Result (23 failures / -45) org.apache.hadoop.hdfs.server.balancer.TestBalancer.testBalancerWithStripedFile org.apache.hadoop.hdfs.server.balancer.TestBalancer.testBalancerWithIncludeListWithPorts org.apache.hadoop.hdfs.server.balancer.TestBalancer.testBalancerWithSortTopNodes org.apache.hadoop.hdfs.server.balancer.TestBalancer.testBalancerCliWithIncludeListWithPorts org.apache.hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks.testSetRepIncWithUnderReplicatedBlocks org.apache.hadoop.hdfs.server.namenode.TestFsck.testECFsck org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeDecommision org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testUpdateAppStateXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppQueueXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppTimeoutsXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testGetContainersXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testUpdateAppPriorityXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testUpdateAppQueueXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppTimeoutXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppAttemptXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testGetAppAttemptXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppStateXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testAppPriorityXML org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST.testGetAppsMultiThread org.apache.hadoop.tools.dynamometer.TestDynamometerInfra.org.apache.hadoop.tools.dynamometer.TestDynamometerInfra org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithOpportunisticContainers org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithEnforceExecutionType ``` - TestFsck: could be failing after the installation of intel-ISA. - TestUnderReplicatedBlocks: I remember I saw that unit test failing before. - TestBalancer: Interesting that there are several failures. I haven't looked into that yet. I guess there is a race condition somewhere in the code path. - TestRouterWebServicesREST, TestDynamometerInfra, TestDistributedShell: are failing for sometime now. By the way, I found that TestDistributedShell does not clean at all. The problem that the two failing unit tests leave several processes running for sometime. It could be one of the reasons the system crashes as the background containers are sucking memory and CPU resources. I am going to address that sometime soon. Hopefully this will enhance the stability of the overall Yetus execution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524136) Time Spent: 1h 50m (was: 1h 40m) > TestUpgradeDomainBlockPlacementPolicy flaky > --- > > Key: HDFS-15716 > URL: https://issues.apache.org/jira/browse/HDFS-15716 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode, test >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > In some slow runs
[jira] [Commented] (HDFS-15729) Show progress of Balancer in Namenode UI
[ https://issues.apache.org/jira/browse/HDFS-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249306#comment-17249306 ] Wei-Chiu Chuang commented on HDFS-15729: Balancer runs as a separate process, probably not a good idea to add one to NN web UI. That said, we recently made balancer a daemon, and which exposes jmx. Maybe we can add a UI to balancer and let it show progress. > Show progress of Balancer in Namenode UI > > > Key: HDFS-15729 > URL: https://issues.apache.org/jira/browse/HDFS-15729 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Affects Versions: 3.1.4 >Reporter: Srinivasu Majeti >Priority: Major > > It would be nice to have a tracking of Balancer process in the Namenode UI to > show if something is running and what is the progress to show current status > . This would be similar to Namenode startup progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15704) Mitigate lease monitor's rapid infinite loop
[ https://issues.apache.org/jira/browse/HDFS-15704?focusedWorklogId=524104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-524104 ] ASF GitHub Bot logged work on HDFS-15704: - Author: ASF GitHub Bot Created on: 14/Dec/20 20:50 Start Date: 14/Dec/20 20:50 Worklog Time Spent: 10m Work Description: jbrennan333 commented on a change in pull request #2511: URL: https://github.com/apache/hadoop/pull/2511#discussion_r542744744 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java ## @@ -92,21 +91,15 @@ private long lastHolderUpdateTime; private String internalLeaseHolder; + // // Used for handling lock-leases // Mapping: leaseHolder -> Lease - private final SortedMap leases = new TreeMap<>(); - // Set of: Lease - private final NavigableSet sortedLeases = new TreeSet<>( - new Comparator() { -@Override -public int compare(Lease o1, Lease o2) { - if (o1.getLastUpdate() != o2.getLastUpdate()) { -return Long.signum(o1.getLastUpdate() - o2.getLastUpdate()); - } else { -return o1.holder.compareTo(o2.holder); - } -} - }); + // TreeMap has O(log(n)) complexity but it is more space efficient + // compared to HashMap. Therefore, replacing TreeMap with a + // HashMap can be considered to get faster O(1) time complexity + // on the expense of 30% memory waste. + // Review comment: This explanation belongs in the Jira, not in a comment in the code. When looking at current code, it's not really clear why you are talking about TreeMap at all. Also, I think this comment misses the point of the TreeMap. The reason a TreeMap was used here was to maintain a sorted order, which allowed the checkLeases() to exit the while loop as soon as it hit an unexpired lease. The new design removes the need for the TreeMap by pruning the list it passes to checkLeases(). ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java ## @@ -541,10 +550,10 @@ public void run() { fsnamesystem.getEditLog().logSync(); } } - - Thread.sleep(fsnamesystem.getLeaseRecheckIntervalMs()); } catch(InterruptedException ie) { - LOG.debug("{} is interrupted", name, ie); + if (LOG.isDebugEnabled()) { Review comment: This LOG.isDebugEnabled() check is not needed anymore. ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java ## @@ -515,6 +508,13 @@ public void setLeasePeriod(long softLimit, long hardLimit) { this.softLimit = softLimit; this.hardLimit = hardLimit; } + + private synchronized Collection getExpiredCandidateLeases() { +final long now = Time.monotonicNow(); +return leases.values().stream() +.filter(lease -> lease.expiredHardLimit(now)) +.collect(Collectors.toCollection(HashSet::new)); + } Review comment: I much prefer the loop in @daryn-sharp's original code. Collection expired = new HashSet<>(); for (Lease lease : leases) { if (lease.expiredHardLimit(now)) { expired.add(lease); } } This streams code will have to change if we want to pull this back to branch 2. I think @daryn-sharp also said that stream()'s are more expensive. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 524104) Time Spent: 1h (was: 50m) > Mitigate lease monitor's rapid infinite loop > > > Key: HDFS-15704 > URL: https://issues.apache.org/jira/browse/HDFS-15704 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > [~daryn] reported that the lease monitor goes into a rapid infinite loop if > an exception occurs during a lease recovery. The two main issues are: > # lease monitor thread does not sleep if an exception occurs before looping > again > # the loop peeks at the first element of a sorted tree set so when an > exception occurs, the "bad" lease remains as the first element preventing > recovery of
[jira] [Updated] (HDFS-15725) Lease Recovery never completes for a committed block which the DNs never finalize
[ https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15725: - Resolution: Fixed Status: Resolved (was: Patch Available) > Lease Recovery never completes for a committed block which the DNs never > finalize > - > > Key: HDFS-15725 > URL: https://issues.apache.org/jira/browse/HDFS-15725 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, > HDFS-15725.003.patch, HDFS-15725.branch-2.10.001.patch, > HDFS-15725.branch-3.2.001.patch, lease_recovery_2_10.patch > > > It a very rare condition, the HDFS client process can get killed right at the > time it is completing a block / file. > The client sends the "complete" call to the namenode, moving the block into a > committed state, but it dies before it can send the final packet to the > Datanodes telling them to finalize the block. > This means the blocks are stuck on the datanodes in RBW state and nothing > will ever tell them to move out of that state. > The namenode / lease manager will retry forever to close the file, but it > will always complain it is waiting for blocks to reach minimal replication. > I have a simple test and patch to fix this, but I think it warrants some > discussion on whether this is the correct thing to do, or if I need to put > the fix behind a config switch. > My idea, is that if lease recovery occurs, and the block is still waiting on > "minimal replication", just put the file back to UNDER_CONSTRUCTION so that > on the next lease recovery attempt, BLOCK RECOVERY will happen, close the > file and move the replicas to FINALIZED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15725) Lease Recovery never completes for a committed block which the DNs never finalize
[ https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-15725: - Fix Version/s: 3.2.3 2.10.2 3.1.5 3.4.0 3.3.1 > Lease Recovery never completes for a committed block which the DNs never > finalize > - > > Key: HDFS-15725 > URL: https://issues.apache.org/jira/browse/HDFS-15725 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, > HDFS-15725.003.patch, HDFS-15725.branch-2.10.001.patch, > HDFS-15725.branch-3.2.001.patch, lease_recovery_2_10.patch > > > It a very rare condition, the HDFS client process can get killed right at the > time it is completing a block / file. > The client sends the "complete" call to the namenode, moving the block into a > committed state, but it dies before it can send the final packet to the > Datanodes telling them to finalize the block. > This means the blocks are stuck on the datanodes in RBW state and nothing > will ever tell them to move out of that state. > The namenode / lease manager will retry forever to close the file, but it > will always complain it is waiting for blocks to reach minimal replication. > I have a simple test and patch to fix this, but I think it warrants some > discussion on whether this is the correct thing to do, or if I need to put > the fix behind a config switch. > My idea, is that if lease recovery occurs, and the block is still waiting on > "minimal replication", just put the file back to UNDER_CONSTRUCTION so that > on the next lease recovery attempt, BLOCK RECOVERY will happen, close the > file and move the replicas to FINALIZED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15725) Lease Recovery never completes for a committed block which the DNs never finalize
[ https://issues.apache.org/jira/browse/HDFS-15725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248969#comment-17248969 ] Stephen O'Donnell commented on HDFS-15725: -- Committed this all the way down to branch-2.10. Thanks for the reviews [~kihwal] and [~szetszwo]. > Lease Recovery never completes for a committed block which the DNs never > finalize > - > > Key: HDFS-15725 > URL: https://issues.apache.org/jira/browse/HDFS-15725 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Fix For: 3.3.1, 3.4.0, 3.1.5, 2.10.2, 3.2.3 > > Attachments: HDFS-15725.001.patch, HDFS-15725.002.patch, > HDFS-15725.003.patch, HDFS-15725.branch-2.10.001.patch, > HDFS-15725.branch-3.2.001.patch, lease_recovery_2_10.patch > > > It a very rare condition, the HDFS client process can get killed right at the > time it is completing a block / file. > The client sends the "complete" call to the namenode, moving the block into a > committed state, but it dies before it can send the final packet to the > Datanodes telling them to finalize the block. > This means the blocks are stuck on the datanodes in RBW state and nothing > will ever tell them to move out of that state. > The namenode / lease manager will retry forever to close the file, but it > will always complain it is waiting for blocks to reach minimal replication. > I have a simple test and patch to fix this, but I think it warrants some > discussion on whether this is the correct thing to do, or if I need to put > the fix behind a config switch. > My idea, is that if lease recovery occurs, and the block is still waiting on > "minimal replication", just put the file back to UNDER_CONSTRUCTION so that > on the next lease recovery attempt, BLOCK RECOVERY will happen, close the > file and move the replicas to FINALIZED. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13965) hadoop.security.kerberos.ticket.cache.path setting is not honored when KMS encryption is enabled.
[ https://issues.apache.org/jira/browse/HDFS-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248902#comment-17248902 ] Arun Prabu commented on HDFS-13965: --- Thanks for the suggestion. We are planning to use ticket cache file specific to our software as that would solve the issue. We will validate this completely and get back > hadoop.security.kerberos.ticket.cache.path setting is not honored when KMS > encryption is enabled. > - > > Key: HDFS-13965 > URL: https://issues.apache.org/jira/browse/HDFS-13965 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client, kms >Affects Versions: 2.7.3, 2.7.7 >Reporter: LOKESKUMAR VIJAYAKUMAR >Assignee: Kitti Nanasi >Priority: Major > > _We use the *+hadoop.security.kerberos.ticket.cache.path+* setting to provide > a custom kerberos cache path for all hadoop operations to be run as specified > user. But this setting is not honored when KMS encryption is enabled._ > _The below program to read a file works when KMS encryption is not enabled, > but it fails when the KMS encryption is enabled._ > _Looks like *hadoop.security.kerberos.ticket.cache.path* setting is not > honored by *createConnection on KMSClientProvider.java.*_ > > HadoopTest.java (CLASSPATH needs to be set to compile and run) > > import java.io.InputStream; > import java.net.URI; > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.Path; > > public class HadoopTest { > public static int runRead(String[] args) throws Exception{ > if (args.length < 3) { > System.err.println("HadoopTest hadoop_file_path > hadoop_user kerberos_cache"); > return 1; > } > Path inputPath = new Path(args[0]); > Configuration conf = new Configuration(); > URI defaultURI = FileSystem.getDefaultUri(conf); > > conf.set("hadoop.security.kerberos.ticket.cache.path",args[2]); > FileSystem fs = > FileSystem.newInstance(defaultURI,conf,args[1]); > InputStream is = fs.open(inputPath); > byte[] buffer = new byte[4096]; > int nr = is.read(buffer); > while (nr != -1) > { > System.out.write(buffer, 0, nr); > nr = is.read(buffer); > } > return 0; > } > public static void main( String[] args ) throws Exception { > int returnCode = HadoopTest.runRead(args); > System.exit(returnCode); > } > } > > > > [root@lstrost3 testhadoop]# pwd > /testhadoop > > [root@lstrost3 testhadoop]# ls > HadoopTest.java > > [root@lstrost3 testhadoop]# export CLASSPATH=`hadoop classpath --glob`:. > > [root@lstrost3 testhadoop]# javac HadoopTest.java > > [root@lstrost3 testhadoop]# java HadoopTest > HadoopTest hadoop_file_path hadoop_user kerberos_cache > > [root@lstrost3 testhadoop]# java HadoopTest /loki/loki.file loki > /tmp/krb5cc_1006 > 18/09/27 23:23:20 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 18/09/27 23:23:21 WARN shortcircuit.DomainSocketFactory: The short-circuit > local reads feature cannot be used because libhadoop cannot be loaded. > Exception in thread "main" java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > GSSException: *{color:#FF}No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt){color}* > at > {color:#FF}*org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:551)*{color} > at > org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:831) > at > org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388) > at > org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1393) > at > org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1463) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:333) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786) >