[jira] [Created] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test
Felix N created HDFS-17539: -- Summary: TestFileChecksum should not spin up a MiniDFSCluster for every test Key: HDFS-17539 URL: https://issues.apache.org/jira/browse/HDFS-17539 Project: Hadoop HDFS Issue Type: Improvement Reporter: Felix N Assignee: Felix N TestFileChecksum has 34 tests. Add its brother the parameterized COMPOSITE_CRC version and that's 68 times a cluster is spun up then shutdown when twice is necessary (or maybe even once but 2 is not too bad). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17474) [FGL] Make INodeMap thread safe
[ https://issues.apache.org/jira/browse/HDFS-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849863#comment-17849863 ] Felix N commented on HDFS-17474: Hi [~zhanghaobo] , more detailed descriptions will be in the PR once I'm done cleaning up the code (the code is roughly done for now but stills need some tidying up) but here's the rough idea: the underlying data structure used in INodeMap (and BlocksMap also) is a GSet, and it should be made thread-safe (i.e. ThreadSafeGSet). The most simple option is a ConcurrentHashMap, but a HashMap occupies way more memory than the current LightWeightGSet and that makes it not viable as a solution. I have 3 options in mind: * Just a LightWeightGSet but synchronize all its operations * LightWeightGSet with a lock for each element, and operations on an element can synchronize on the lock assigned to that element * LightWeightGSet with a lock for a group of elements (grouped in powers of 2), and operations on an element can synchronize on the lock assigned to that element. Or use a ReentrantReadWriteLock instead of synchronizing on a lock object. I'm still benchmarking but the implementation I'm leaning towards right now is the 3rd option with lock objects. > [FGL] Make INodeMap thread safe > --- > > Key: HDFS-17474 > URL: https://issues.apache.org/jira/browse/HDFS-17474 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: Felix N >Priority: Major > > Operations related to INodeMap should be handled by namenode safety, since > operations maybe access or update INodeMap concurrently. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) RBF: Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Summary: RBF: Improve router state store cache entry deletion (was: Improve router state store cache entry deletion) > RBF: Improve router state store cache entry deletion > > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to improve the deletion process for ZK state store > implementation. > See HDFS-17532 for the other half of this improvement -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17532) Allow router state store cache update to overwrite and delete in parallel
Felix N created HDFS-17532: -- Summary: Allow router state store cache update to overwrite and delete in parallel Key: HDFS-17532 URL: https://issues.apache.org/jira/browse/HDFS-17532 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, rbf Reporter: Felix N Assignee: Felix N Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket aims to allow the overwrite part and delete part of org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords to run in parallel. See HDFS-17529 for the other half of this improvement. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Description: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket aims to improve the deletion process for ZK state store implementation. See HDFS-17532 for the other half of this improvement was: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket aims to improve the deletion process for ZK state store implementation. > Improve router state store cache entry deletion > --- > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to improve the deletion process for ZK state store > implementation. > See HDFS-17532 for the other half of this improvement -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Description: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket aims to improve the deletion process for ZK state store implementation. was: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket contains 2 parts: improving the deletion process for ZK state store implementation, and allowing the overwrite part and delete part of org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords to run in parallel. > Improve router state store cache entry deletion > --- > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to improve the deletion process for ZK state store > implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Summary: Improve router state store cache entry deletion (was: Improve router state store cache update) > Improve router state store cache entry deletion > --- > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket contains 2 parts: improving the deletion process for ZK state > store implementation, and allowing the overwrite part and delete part of > org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords > to run in parallel. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) Improve router state store cache update
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Description: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket contains 2 parts: improving the deletion process for ZK state store implementation, and allowing the overwrite part and delete part of org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords to run in parallel. was: Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket contains 2 parts: improving the deletion process for ZK state store implementation, and allowing the overwrite part and delete part of > Improve router state store cache update > --- > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket contains 2 parts: improving the deletion process for ZK state > store implementation, and allowing the overwrite part and delete part of > org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords > to run in parallel. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17529) Improve router state store cache update
Felix N created HDFS-17529: -- Summary: Improve router state store cache update Key: HDFS-17529 URL: https://issues.apache.org/jira/browse/HDFS-17529 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, rbf Reporter: Felix N Assignee: Felix N Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket contains 2 parts: improving the deletion process for ZK state store implementation, and allowing the overwrite part and delete part of -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17492) [FGL] Abstract a INodeLockManager to manage acquiring and releasing locks in the directory-tree
[ https://issues.apache.org/jira/browse/HDFS-17492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17492: --- Summary: [FGL] Abstract a INodeLockManager to manage acquiring and releasing locks in the directory-tree (was: [FGL] Abstract a INodeLockManager to mange acquiring and releasing locks in the directory-tree) > [FGL] Abstract a INodeLockManager to manage acquiring and releasing locks in > the directory-tree > --- > > Key: HDFS-17492 > URL: https://issues.apache.org/jira/browse/HDFS-17492 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > Abstract a INodeLockManager to mange acquiring and releasing locks in the > directory-tree. > # Abstract a lock type to cover all cases in NN > # Acquire the full path lock for the input path base on the input lock type > # Acquire the full path lock for the input iNodeId base on the input lock > type > # Acquire the full path lock for some input paths, such as for rename, concat > > INodeLockManager should returns an IIP which contains both iNodes and locks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed
[ https://issues.apache.org/jira/browse/HDFS-17488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839556#comment-17839556 ] Felix N commented on HDFS-17488: Hi [~zhanghaobo], thanks for letting me know. I think you can review my PR since it should contain your patch + some extra steps to prevent the situation from appearing + unit tests > DN can fail IBRs with NPE when a volume is removed > -- > > Key: HDFS-17488 > URL: https://issues.apache.org/jira/browse/HDFS-17488 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > > Error logs > {code:java} > 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 > heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode > (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool > BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid > 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246) > at > org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920) > at java.lang.Thread.run(Thread.java:748) {code} > The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's > called on a block belonging to a volume already removed prior. Because the > volume was already removed > > {code:java} > private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status, > String delHint, String storageUuid, boolean isOnTransientStorage) { > checkBlock(block); > final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo( > block.getLocalBlock(), status, delHint); > final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid); > > // storage == null here because it's already removed earlier. > for (BPServiceActor actor : bpServices) { > actor.getIbrManager().notifyNamenodeBlock(info, storage, > isOnTransientStorage); > } > } {code} > so IBRs with a null storage are now pending. > The reason why notifyNamenodeBlock can trigger on such blocks is up in > DirectoryScanner#reconcile > {code:java} > public void reconcile() throws IOException { > LOG.debug("reconcile start DirectoryScanning"); > scan(); > // If a volume is removed here after scan() already finished running, > // diffs is stale and checkAndUpdate will run on a removed volume > // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too > // long > int loopCount = 0; > synchronized (diffs) { > for (final Map.Entry entry : diffs.getEntries()) { > dataset.checkAndUpdate(entry.getKey(), entry.getValue()); > ... > } {code} > Inside checkAndUpdate, memBlockInfo is null because all the block meta in > memory is removed during the volume removal, but diskFile still exists. Then > DataNode#notifyNamenodeDeletedBlock (and further down the line, > notifyNamenodeBlock) is called on this block. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17488) DN can fail IBRs with NPE when a volume is removed
Felix N created HDFS-17488: -- Summary: DN can fail IBRs with NPE when a volume is removed Key: HDFS-17488 URL: https://issues.apache.org/jira/browse/HDFS-17488 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Felix N Assignee: Felix N Error logs {code:java} 2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977 java.lang.NullPointerException at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246) at org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920) at java.lang.Thread.run(Thread.java:748) {code} The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's called on a block belonging to a volume already removed prior. Because the volume was already removed {code:java} private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status, String delHint, String storageUuid, boolean isOnTransientStorage) { checkBlock(block); final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo( block.getLocalBlock(), status, delHint); final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid); // storage == null here because it's already removed earlier. for (BPServiceActor actor : bpServices) { actor.getIbrManager().notifyNamenodeBlock(info, storage, isOnTransientStorage); } } {code} so IBRs with a null storage are now pending. The reason why notifyNamenodeBlock can trigger on such blocks is up in DirectoryScanner#reconcile {code:java} public void reconcile() throws IOException { LOG.debug("reconcile start DirectoryScanning"); scan(); // If a volume is removed here after scan() already finished running, // diffs is stale and checkAndUpdate will run on a removed volume // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too // long int loopCount = 0; synchronized (diffs) { for (final Map.Entry entry : diffs.getEntries()) { dataset.checkAndUpdate(entry.getKey(), entry.getValue()); ... } {code} Inside checkAndUpdate, memBlockInfo is null because all the block meta in memory is removed during the volume removal, but diskFile still exists. Then DataNode#notifyNamenodeDeletedBlock (and further down the line, notifyNamenodeBlock) is called on this block. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17475) Add a command to check if files are readable
[ https://issues.apache.org/jira/browse/HDFS-17475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838119#comment-17838119 ] Felix N commented on HDFS-17475: Hi [~ayushtkn], the requirements for this feature did indeed come from our production users. While fsck can check if some blocks are missing, AFAIK, a successful fsck doesn't guarantee all blocks being readable. This feature aims to provide a method to verify a large number of files are readable or not without going through the full read pipeline for each file. > Add a command to check if files are readable > > > Key: HDFS-17475 > URL: https://issues.apache.org/jira/browse/HDFS-17475 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Felix N >Assignee: Felix N >Priority: Minor > Labels: pull-request-available > > Sometimes a job can fail due to one unreadable file down the line due to > missing replicas or dead DNs or other reason. This command should allow users > to check whether files are readable by checking for metadata on DNs without > executing full read pipelines of the files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17475) Add a command to check if files are readable
Felix N created HDFS-17475: -- Summary: Add a command to check if files are readable Key: HDFS-17475 URL: https://issues.apache.org/jira/browse/HDFS-17475 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Felix N Assignee: Felix N Fix For: 3.5.0 Sometimes a job can fail due to one unreadable file down the line due to missing replicas or dead DNs or other reason. This command should allow users to check whether files are readable by checking for metadata on DNs without executing full read pipelines of the files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17459) [FGL] Summarize this feature
[ https://issues.apache.org/jira/browse/HDFS-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836396#comment-17836396 ] Felix N commented on HDFS-17459: I can help with this one. I assume it's documentation for this feature? > [FGL] Summarize this feature > - > > Key: HDFS-17459 > URL: https://issues.apache.org/jira/browse/HDFS-17459 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > Write a doc to summarize this feature so we can merge it into the trunk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17271) Web UI DN report shows random order when sorting with dead DNs
Felix N created HDFS-17271: -- Summary: Web UI DN report shows random order when sorting with dead DNs Key: HDFS-17271 URL: https://issues.apache.org/jira/browse/HDFS-17271 Project: Hadoop HDFS Issue Type: Bug Components: namenode, rbf, ui Affects Versions: 3.4.0 Reporter: Felix N Assignee: Felix N Fix For: 3.4.0 Attachments: image-2023-12-01-15-04-11-047.png When sorted by "last contact" in ascending order, dead nodes come up on top in a random order !image-2023-12-01-15-04-11-047.png|width=337,height=263! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router
[ https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528665#comment-17528665 ] Felix N commented on HDFS-16539: Might need to add some documentation somewhere since this patch makes use of the generic refresh and the feature seems to be underutilized for routers. > RBF: Support refreshing/changing router fairness policy controller without > rebooting router > --- > > Key: HDFS-16539 > URL: https://issues.apache.org/jira/browse/HDFS-16539 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Felix N >Assignee: Felix N >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Add support for refreshing/changing router fairness policy controller without > the need to reboot a router. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router
[ https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N resolved HDFS-16539. Fix Version/s: 3.4.0 Resolution: Fixed > RBF: Support refreshing/changing router fairness policy controller without > rebooting router > --- > > Key: HDFS-16539 > URL: https://issues.apache.org/jira/browse/HDFS-16539 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Felix N >Assignee: Felix N >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Add support for refreshing/changing router fairness policy controller without > the need to reboot a router. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}
[ https://issues.apache.org/jira/browse/HDFS-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528060#comment-17528060 ] Felix N commented on HDFS-14750: Tried my hands at it since there seems to be no updates for this ticket. The rough idea is to utilize the metrics added by HDFS-16296 and HDFS-16302, spawn a background thread that resizes the semaphores periodically based on the traffic to the namespaces (determined from the metrics). > RBF: Improved isolation for downstream name nodes. {Dynamic} > > > Key: HDFS-14750 > URL: https://issues.apache.org/jira/browse/HDFS-14750 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: CR Hota >Assignee: CR Hota >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This Jira tracks the work around dynamic allocation of resources in routers > for downstream hdfs clusters. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router
[ https://issues.apache.org/jira/browse/HDFS-16539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-16539: --- Description: Add support for refreshing/changing router fairness policy controller without the need to reboot a router. > RBF: Support refreshing/changing router fairness policy controller without > rebooting router > --- > > Key: HDFS-16539 > URL: https://issues.apache.org/jira/browse/HDFS-16539 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Felix N >Assignee: Felix N >Priority: Minor > > Add support for refreshing/changing router fairness policy controller without > the need to reboot a router. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16539) RBF: Support refreshing/changing router fairness policy controller without rebooting router
Felix N created HDFS-16539: -- Summary: RBF: Support refreshing/changing router fairness policy controller without rebooting router Key: HDFS-16539 URL: https://issues.apache.org/jira/browse/HDFS-16539 Project: Hadoop HDFS Issue Type: Improvement Reporter: Felix N Assignee: Felix N -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
[ https://issues.apache.org/jira/browse/HDFS-16417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17470308#comment-17470308 ] Felix N commented on HDFS-16417: Hi [~elgoiri], sorry I was using the wrong formatter, hence the imports change. I have submitted a PR with that addressed. > RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if > concurrent ns handler count is configured > --- > > Key: HDFS-16417 > URL: https://issues.apache.org/jira/browse/HDFS-16417 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Felix N >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-16417.001.patch > > Time Spent: 10m > Remaining Estimate: 0h > > If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, > unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will > throw a /0 exception. > Changed it to assigning extra handlers to concurrent ns in case it's > configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
[ https://issues.apache.org/jira/browse/HDFS-16417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-16417: --- Attachment: HDFS-16417.001.patch Status: Patch Available (was: Open) > RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if > concurrent ns handler count is configured > --- > > Key: HDFS-16417 > URL: https://issues.apache.org/jira/browse/HDFS-16417 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Felix N >Priority: Minor > Attachments: HDFS-16417.001.patch > > > If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, > unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will > throw a /0 exception. > Changed it to assigning extra handlers to concurrent ns in case it's > configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
Felix N created HDFS-16417: -- Summary: RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured Key: HDFS-16417 URL: https://issues.apache.org/jira/browse/HDFS-16417 Project: Hadoop HDFS Issue Type: Bug Components: rbf Reporter: Felix N If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will throw a /0 exception. Changed it to assigning extra handlers to concurrent ns in case it's configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file
[ https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319870#comment-17319870 ] Felix N commented on HDFS-15972: Hi [~LiJinglun], is this the expected behavior? During heavy write period, this might lead to data loss. > Fedbalance only copies data partially when there's existing opened file > --- > > Key: HDFS-15972 > URL: https://issues.apache.org/jira/browse/HDFS-15972 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Felix N >Priority: Major > > If there are opened files when fedbalance is run and data is being written to > these files, fedbalance might skip the newly written data. > Steps to recreate the issue: > # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs > -appendToFile /test/file}} > # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do > not stop writing > # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test > hdfs://ns2/test}} > # Write something to the file while fedbalance is running, "end" for > example, then stop writing > # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain > "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains > "start\nend" > Fedbalance is run with default configs and arguments so no diff should happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Moved] (HDFS-15972) Fedbalance only copies data partially when there's existing opened file
[ https://issues.apache.org/jira/browse/HDFS-15972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N moved HADOOP-17634 to HDFS-15972: - Key: HDFS-15972 (was: HADOOP-17634) Project: Hadoop HDFS (was: Hadoop Common) > Fedbalance only copies data partially when there's existing opened file > --- > > Key: HDFS-15972 > URL: https://issues.apache.org/jira/browse/HDFS-15972 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Felix N >Priority: Major > > If there are opened files when fedbalance is run and data is being written to > these files, fedbalance might skip the newly written data. > Steps to recreate the issue: > # Create a dummy file /test/file with some data: {{echo "start" | hdfs dfs > -appendToFile /test/file}} > # Start writing to the file: {{hdfs dfs -appendToFile /test/file}} but do > not stop writing > # Run fedbalance: {{hadoop fedbalance submit hdfs://ns1/test > hdfs://ns2/test}} > # Write something to the file while fedbalance is running, "end" for > example, then stop writing > # After fedbalance is done, {{hdfs://ns2/test/file}} should only contain > "start" while {{hdfs://ns1/user/hadoop/.Trash/Current/test/file}} contains > "start\nend" > Fedbalance is run with default configs and arguments so no diff should happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15294) Federation balance tool
[ https://issues.apache.org/jira/browse/HDFS-15294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17224549#comment-17224549 ] Felix N commented on HDFS-15294: Thanks for the awesome work! May I ask if there is any plan to backport this feature to 2.x? > Federation balance tool > --- > > Key: HDFS-15294 > URL: https://issues.apache.org/jira/browse/HDFS-15294 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Fix For: 3.4.0 > > Attachments: BalanceProcedureScheduler.png, HDFS-15294.001.patch, > HDFS-15294.002.patch, HDFS-15294.003.patch, HDFS-15294.003.reupload.patch, > HDFS-15294.004.patch, HDFS-15294.005.patch, HDFS-15294.006.patch, > HDFS-15294.007.patch, distcp-balance.pdf, distcp-balance.v2.pdf > > > This jira introduces a new HDFS federation balance tool to balance data > across different federation namespaces. It uses Distcp to copy data from the > source path to the target path. > The process is: > 1. Use distcp and snapshot diff to sync data between src and dst until they > are the same. > 2. Update mount table in Router if we specified RBF mode. > 3. Deal with src data, move to trash, delete or skip them. > The design of fedbalance tool comes from the discussion in HDFS-15087. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org