[jira] [Commented] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156508#comment-17156508 ] hemanthboyina commented on HDFS-12969: -- test failures are not related , please review > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-12969.001.patch, HDFS-12969.002.patch > > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-12969: - Attachment: HDFS-12969.002.patch > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-12969.001.patch, HDFS-12969.002.patch > > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156118#comment-17156118 ] hemanthboyina commented on HDFS-12969: -- discussed with manoj offline , thanks [~manojg] for letting me to take up the jira thanks [~tasanuma] for comments {quote}but does it need to change listOpenFiles API? I'm a bit worried about one more deprecated listOpenFiles API {quote} we are adding an additional field to the return value of an existing API , so i dont think we need to deprecate a API attached patch , please review > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-12969.001.patch > > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-12969: - Attachment: HDFS-12969.001.patch Status: Patch Available (was: Open) > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-12969.001.patch > > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15446) CreateSnapshotOp fails during edit log loading for /.reserved/raw/path with error java.io.FileNotFoundException: Directory does not exist: /.reserved/raw/path
[ https://issues.apache.org/jira/browse/HDFS-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151150#comment-17151150 ] hemanthboyina commented on HDFS-15446: -- [~ayushtkn] we can go ahead > CreateSnapshotOp fails during edit log loading for /.reserved/raw/path with > error java.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/path > --- > > Key: HDFS-15446 > URL: https://issues.apache.org/jira/browse/HDFS-15446 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.0, 3.3.0 >Reporter: Srinivasu Majeti >Assignee: Stephen O'Donnell >Priority: Major > Labels: reserved-word, snapshot > Attachments: HDFS-15446.001.patch, HDFS-15446.002.patch, > HDFS-15446.003.patch > > > After allowing snapshot creation for a path say /app-logs , when we try to > create snapshot on > /.reserved/raw/app-logs , its successful with snapshot creation but later > when Standby Namenode is restarted and tries to load the edit record > OP_CREATE_SNAPSHOT , we see it failing and Standby Namenode shuts down with > an exception "ava.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/app-logs" . > Here are the steps to reproduce : > {code:java} > # hdfs dfs -ls /.reserved/raw/ > Found 15 items > drwxrwxrwt - yarn hadoop 0 2020-06-29 10:27 > /.reserved/raw/app-logs > drwxr-xr-x - hive hadoop 0 2020-06-29 10:29 /.reserved/raw/prod > ++ > [root@c3230-node2 ~]# hdfs dfsadmin -allowSnapshot /app-logs > Allowing snapshot on /app-logs succeeded > [root@c3230-node2 ~]# hdfs dfsadmin -allowSnapshot /prod > Allowing snapshot on /prod succeeded > ++ > # hdfs lsSnapshottableDir > drwxrwxrwt 0 yarn hadoop 0 2020-06-29 10:27 1 65536 /app-logs > drwxr-xr-x 0 hive hadoop 0 2020-06-29 10:29 1 65536 /prod > ++ > [root@c3230-node2 ~]# hdfs dfs -createSnapshot /.reserved/raw/app-logs testSS > Created snapshot /.reserved/raw/app-logs/.snapshot/testSS > {code} > Exception we see in Standby namenode while loading the snapshot creation edit > record. > {code:java} > 2020-06-29 10:33:25,488 ERROR namenode.NameNode (NameNode.java:main(1715)) - > Failed to start namenode. > java.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/app-logs > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.valueOf(INodeDirectory.java:60) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.getSnapshottableRoot(SnapshotManager.java:259) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:307) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:772) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:257) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15446) CreateSnapshotOp fails during edit log loading for /.reserved/raw/path with error java.io.FileNotFoundException: Directory does not exist: /.reserved/raw/path
[ https://issues.apache.org/jira/browse/HDFS-15446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149591#comment-17149591 ] hemanthboyina commented on HDFS-15446: -- thanks [~sodonnell] for your work , i had a small question on latest patch in the existing code we are using getINodesInPath for getting iip , getINodesInPath internally calls checkTraverse(pc, iip, resolveLink) i think we are missing the functionality of this method in the latest patch , aren't we required this existing functionality ? > CreateSnapshotOp fails during edit log loading for /.reserved/raw/path with > error java.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/path > --- > > Key: HDFS-15446 > URL: https://issues.apache.org/jira/browse/HDFS-15446 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.0, 3.3.0 >Reporter: Srinivasu Majeti >Assignee: Stephen O'Donnell >Priority: Major > Labels: reserved-word, snapshot > Attachments: HDFS-15446.001.patch, HDFS-15446.002.patch, > HDFS-15446.003.patch > > > After allowing snapshot creation for a path say /app-logs , when we try to > create snapshot on > /.reserved/raw/app-logs , its successful with snapshot creation but later > when Standby Namenode is restarted and tries to load the edit record > OP_CREATE_SNAPSHOT , we see it failing and Standby Namenode shuts down with > an exception "ava.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/app-logs" . > Here are the steps to reproduce : > {code:java} > # hdfs dfs -ls /.reserved/raw/ > Found 15 items > drwxrwxrwt - yarn hadoop 0 2020-06-29 10:27 > /.reserved/raw/app-logs > drwxr-xr-x - hive hadoop 0 2020-06-29 10:29 /.reserved/raw/prod > ++ > [root@c3230-node2 ~]# hdfs dfsadmin -allowSnapshot /app-logs > Allowing snapshot on /app-logs succeeded > [root@c3230-node2 ~]# hdfs dfsadmin -allowSnapshot /prod > Allowing snapshot on /prod succeeded > ++ > # hdfs lsSnapshottableDir > drwxrwxrwt 0 yarn hadoop 0 2020-06-29 10:27 1 65536 /app-logs > drwxr-xr-x 0 hive hadoop 0 2020-06-29 10:29 1 65536 /prod > ++ > [root@c3230-node2 ~]# hdfs dfs -createSnapshot /.reserved/raw/app-logs testSS > Created snapshot /.reserved/raw/app-logs/.snapshot/testSS > {code} > Exception we see in Standby namenode while loading the snapshot creation edit > record. > {code:java} > 2020-06-29 10:33:25,488 ERROR namenode.NameNode (NameNode.java:main(1715)) - > Failed to start namenode. > java.io.FileNotFoundException: Directory does not exist: > /.reserved/raw/app-logs > at > org.apache.hadoop.hdfs.server.namenode.INodeDirectory.valueOf(INodeDirectory.java:60) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.getSnapshottableRoot(SnapshotManager.java:259) > at > org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.createSnapshot(SnapshotManager.java:307) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:772) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:257) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15420) approx scheduled blocks not reseting over time
[ https://issues.apache.org/jira/browse/HDFS-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146956#comment-17146956 ] hemanthboyina commented on HDFS-15420: -- you can check timeoutReReplications ( Number of timed out block re-replications) > approx scheduled blocks not reseting over time > -- > > Key: HDFS-15420 > URL: https://issues.apache.org/jira/browse/HDFS-15420 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement >Affects Versions: 2.6.0, 3.0.0 > Environment: Our 2.6.0 environment is a 3 node cluster running > cdh5.15.0. > Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. >Reporter: Max Mizikar >Priority: Minor > Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from > 2020-06-18 09-31-15.png > > > We have been experiencing large amounts of scheduled blocks that never get > cleared out. This is preventing blocks from being placed even when there is > plenty of space on the system. > Here is an example of the block growth over 24 hours on one of our systems > running 2.6.0 > !Screenshot from 2020-06-18 09-29-57.png! > Here is an example of the block growth over 24 hours on one of our systems > running 3.0.0 > !Screenshot from 2020-06-18 09-31-15.png! > https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue > we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, > however, there appears to still be a systemic growth in scheduled blocks over > time and our systems will still need to restart the namenode on occasion to > reset this count. I have not determined what is causing the leaked blocks in > 3.0.0. > Looking into the issue, I discovered that the intention is for scheduled > blocks to slowly go back down to 0 after errors cause blocks to be leaked. > {code} > /** Increment the number of blocks scheduled. */ > void incrementBlocksScheduled(StorageType t) { > currApproxBlocksScheduled.add(t, 1); > } > > /** Decrement the number of blocks scheduled. */ > void decrementBlocksScheduled(StorageType t) { > if (prevApproxBlocksScheduled.get(t) > 0) { > prevApproxBlocksScheduled.subtract(t, 1); > } else if (currApproxBlocksScheduled.get(t) > 0) { > currApproxBlocksScheduled.subtract(t, 1); > } > // its ok if both counters are zero. > } > > /** Adjusts curr and prev number of blocks scheduled every few minutes. */ > private void rollBlocksScheduled(long now) { > if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { > prevApproxBlocksScheduled.set(currApproxBlocksScheduled); > currApproxBlocksScheduled.reset(); > lastBlocksScheduledRollTime = now; > } > } > {code} > However, this code does not do what is intended if the system has a constant > flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the > next scheduled block increments currApproxBlocksScheduled and when it > completes, it decrements prevApproxBlocksScheduled preventing the leaked > block to be removed from the approx count. So, for errors to be corrected, we > have to not write any data for the roll period of 10 minutes. The number of > blocks we write per 10 minutes is quite high. This allows the error on the > approx counts to grow to very large numbers. > The comments in the ticket for the original implementation suggest this > issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, > it's not clear to me if the severity of it was known at the time. > > So if there are some blocks that are not reported back by the datanode, > > they will eventually get adjusted (usually 10 min; bit longer if datanode > > is continuously receiving blocks). > The comments suggest it will eventually get cleared out, but in our case, it > never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15434) RBF: MountTableResolver#getDestinationForPath failing with AssertionError from localCache
[ https://issues.apache.org/jira/browse/HDFS-15434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143851#comment-17143851 ] hemanthboyina commented on HDFS-15434: -- Though we couldn't reproduce the issue , we doubt this issue occured under high concurrency discussed with [~brahmareddy] offline , i think we can extend removalListener to cacheBuilder to solve the problem any suggestions ? > RBF: MountTableResolver#getDestinationForPath failing with AssertionError > from localCache > - > > Key: HDFS-15434 > URL: https://issues.apache.org/jira/browse/HDFS-15434 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Priority: Major > > {code:java} > org.apache.hadoop.ipc.Remote.Exception : java.lang.AssertionError > com.google.common.cache.LocalCache$Segment.evictEntries(LocalCache.java:2698) > at > com.google.common.cache.LocalCache$Segment.storeLoadedValue(LocalCache.java:3166) > > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2386) > at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2351) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) at > at com.google.common.cache.LocalCache.get(LocalCache.java:3965) > at > com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764) > at > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:382) > at > org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1406) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1389) > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:741) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:763) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15434) RBF: MountTableResolver#getDestinationForPath failing with AssertionError from localCache
hemanthboyina created HDFS-15434: Summary: RBF: MountTableResolver#getDestinationForPath failing with AssertionError from localCache Key: HDFS-15434 URL: https://issues.apache.org/jira/browse/HDFS-15434 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina {code:java} org.apache.hadoop.ipc.Remote.Exception : java.lang.AssertionError com.google.common.cache.LocalCache$Segment.evictEntries(LocalCache.java:2698) at com.google.common.cache.LocalCache$Segment.storeLoadedValue(LocalCache.java:3166) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2386) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2351) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) at at com.google.common.cache.LocalCache.get(LocalCache.java:3965) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764) at org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:382) at org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1406) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1389) at org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:741) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:763) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15415) Reduce locking in Datanode DirectoryScanner
[ https://issues.apache.org/jira/browse/HDFS-15415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140699#comment-17140699 ] hemanthboyina commented on HDFS-15415: -- thanks [~sodonnell] for your analysis after taking the snapshot , if we did not acquire the lock , have you considered the scenario if the blocks being converted from RBW to FINALIZED ? {quote}A finalized block could be appended. If that happens both the genstamp and length will change {quote} agree with you , though the replica will be changed from FINALIZED to RBW , so anyways we are getting only the finalized blocks it shouldnt be problem > Reduce locking in Datanode DirectoryScanner > --- > > Key: HDFS-15415 > URL: https://issues.apache.org/jira/browse/HDFS-15415 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15415.001.patch > > > In HDFS-15406, we have a small change to greatly reduce the runtime and > locking time of the datanode DirectoryScanner. They may be room for further > improvement here: > 1. These lines of code in DirectoryScanner#scan(), obtain a snapshot of the > finalized blocks from memory, and then sort them, under the DN lock. However > the blocks are stored in a sorted structure (FoldedTreeSet) and hence the > sort should be unnecessary. > {code} > final List bl = dataset.getFinalizedBlocks(bpid); > Collections.sort(bl); // Sort based on blockId > {code} > 2. From the scan step, we have captured a snapshot of what is on disk. After > calling `dataset.getFinalizedBlocks(bpid);` as above we have taken a snapshot > of in memory. The two snapshots are never 100% in sync as things are always > changing as the disk is scanned. > We are only comparing finalized blocks, so they should not really change: > * If a block is deleted after our snapshot, our snapshot will not see it and > that is OK. > * A finalized block could be appended. If that happens both the genstamp and > length will change, but that should be handled by reconcile when it calls > `FSDatasetImpl.checkAndUpdate()`, and there is nothing stopping blocks being > appended after they have been scanned from disk, but before they have been > compared with memory. > My suspicion is that we can do all the comparison work outside of the lock > and checkAndUpdate() re-checks any differences later under the lock on a > block by block basis. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15420) approx scheduled blocks not reseting over time
[ https://issues.apache.org/jira/browse/HDFS-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17139468#comment-17139468 ] hemanthboyina commented on HDFS-15420: -- thanks [~maxmzkr] for providing the report , a quick question are there any pending reconstruction requests that are timed out? > approx scheduled blocks not reseting over time > -- > > Key: HDFS-15420 > URL: https://issues.apache.org/jira/browse/HDFS-15420 > Project: Hadoop HDFS > Issue Type: Bug > Components: block placement >Affects Versions: 2.6.0, 3.0.0 > Environment: Our 2.6.0 environment is a 3 node cluster running > cdh5.15.0. > Our 3.0.0 environment is a 4 node cluster running cdh6.3.0. >Reporter: Max Mizikar >Priority: Minor > Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from > 2020-06-18 09-31-15.png > > > We have been experiencing large amounts of scheduled blocks that never get > cleared out. This is preventing blocks from being placed even when there is > plenty of space on the system. > Here is an example of the block growth over 24 hours on one of our systems > running 2.6.0 > !Screenshot from 2020-06-18 09-29-57.png! > Here is an example of the block growth over 24 hours on one of our systems > running 3.0.0 > !Screenshot from 2020-06-18 09-31-15.png! > https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue > we were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, > however, there appears to still be a systemic growth in scheduled blocks over > time and our systems will still need to restart the namenode on occasion to > reset this count. I have not determined what is causing the leaked blocks in > 3.0.0. > Looking into the issue, I discovered that the intention is for scheduled > blocks to slowly go back down to 0 after errors cause blocks to be leaked. > {code} > /** Increment the number of blocks scheduled. */ > void incrementBlocksScheduled(StorageType t) { > currApproxBlocksScheduled.add(t, 1); > } > > /** Decrement the number of blocks scheduled. */ > void decrementBlocksScheduled(StorageType t) { > if (prevApproxBlocksScheduled.get(t) > 0) { > prevApproxBlocksScheduled.subtract(t, 1); > } else if (currApproxBlocksScheduled.get(t) > 0) { > currApproxBlocksScheduled.subtract(t, 1); > } > // its ok if both counters are zero. > } > > /** Adjusts curr and prev number of blocks scheduled every few minutes. */ > private void rollBlocksScheduled(long now) { > if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) { > prevApproxBlocksScheduled.set(currApproxBlocksScheduled); > currApproxBlocksScheduled.reset(); > lastBlocksScheduledRollTime = now; > } > } > {code} > However, this code does not do what is intended if the system has a constant > flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the > next scheduled block increments currApproxBlocksScheduled and when it > completes, it decrements prevApproxBlocksScheduled preventing the leaked > block to be removed from the approx count. So, for errors to be corrected, we > have to not write any data for the roll period of 10 minutes. The number of > blocks we write per 10 minutes is quite high. This allows the error on the > approx counts to grow to very large numbers. > The comments in the ticket for the original implementation suggest this > issues was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, > it's not clear to me if the severity of it was known at the time. > > So if there are some blocks that are not reported back by the datanode, > > they will eventually get adjusted (usually 10 min; bit longer if datanode > > is continuously receiving blocks). > The comments suggest it will eventually get cleared out, but in our case, it > never gets cleared out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15416) The addStorageLocations() method in the DataStorage class is not perfect.
[ https://issues.apache.org/jira/browse/HDFS-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138634#comment-17138634 ] hemanthboyina commented on HDFS-15416: -- thanks for filing the issue [~jianghuazhu] successLocations is a list , so you can check successLocations.isEmpty() and return {code:java} final List successLocations = loadDataStorage( datanode, nsInfo, dataDirs, startOpt, executor); {code} > The addStorageLocations() method in the DataStorage class is not perfect. > - > > Key: HDFS-15416 > URL: https://issues.apache.org/jira/browse/HDFS-15416 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.1.0, 3.1.1 >Reporter: jianghua zhu >Priority: Major > > SuccessLocations content is an array, when the number is 0, do not need to be > executed again loadBlockPoolSliceStorage (). > code : > try > { > final List successLocations = loadDataStorage( datanode, > nsInfo, dataDirs, startOpt, executor); > return loadBlockPoolSliceStorage( datanode, nsInfo, successLocations, > startOpt, executor); } > finally > { executor.shutdown(); } > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138627#comment-17138627 ] hemanthboyina commented on HDFS-15406: -- thanks [~sodonnell] , the test failure was related to this change updated the patch as per your suggestion, please review > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Attachment: HDFS-15406.002.patch > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch, HDFS-15406.002.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138275#comment-17138275 ] hemanthboyina commented on HDFS-15406: -- thanks [~sodonnell] for the comment {quote}If you don't want to investigate that as part of this Jira, we can create a sub-jira for it. It might be a trivial change to improve things a bit further, perhaps saving about another 5 seconds. {quote} we can create a sub Jira to track that change > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136730#comment-17136730 ] hemanthboyina commented on HDFS-15406: -- thanks [~sodonnell] for the comment {quote} # After you started caching `getBaseURI()` did it improve the runtime of both the getDiskReport() step and compare with in-memory step?{quote} yes , on caching the getBaseURI the time taken is improved in both places , more details below {quote}2. Looking at the code on trunk, I don't think we create any scanInfo objects under the lock in the compare sections unless there is a difference. If this change improved your runtime under the lock from 6m -> 52 seconds, is this because there is a large number of differences between disk and memory on your cluster for some reason? {quote} i think you are referring creation of scan info objects here because for creating ScanInfo object we use vol.getBaseUri() , Even we do not create scaninfo objects we internally call getBaseURI 3 times inside the lock by info.getBlockFile() , info.getGenStamp() , memBlock.compareWith(info) , so if we have large number of differences between disk and in memory calls to getBaseURI will be even more , So by caching getBaseURI we saved atleast 3 times inside lock and once in outside lock , so there was huge decrease in lock time held ,We tried creating 11M blocks in our independent cluster and we could see lock time is held for 3min 20 sec before caching and upon caching the getBaseURI lock time was52sec {quote}3. Did you do capture any profiles (flame chart or debug log messages) to see how long each part of the code under the lock runs for? I am interested in these lines in DirectoryScanner#scan(): {quote} i didn't capture profiles , but i could see dataset.getFinalizedBlocks(bpid) in stacktrace as it has taken more than 600ms on every iteration > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Attachment: HDFS-15406.001.patch Status: Patch Available (was: Open) > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15406.001.patch > > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135767#comment-17135767 ] hemanthboyina commented on HDFS-15406: -- thanks [~brahmareddy] for the comment {quote}Not sure, whether HDFS-9668 will address the same. {quote} the locking contention was being handled through HDFS-15150 and HDFS-15160 by introducing read and write lock , though these doesn't improve the time taken by the lock which this Jira is aimed to solve and by caching the getBaseURI() , the lock time was reduced to 52sec > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135756#comment-17135756 ] hemanthboyina commented on HDFS-15406: -- discussed with [~pilchard] offline , the major drawback in his report was the configuration , "dfs.datanode.directoryscan.threads" was set as 1 two major points here *) if we have more volumes , the thread count(dfs.datanode.directoryscan.threads) will impact on the time taken by getDiskReport() aka getVolumeReports() , as each volume will be launched by a thread here , if we increase the threads count , the time taken by getDiskReport() will be less *) next we acquire the lock and compare the report to the in memory data For creating ScanInfo object we use vol.getBaseUri() {code:java} FSVolumeSpi#ScanInfo public ScanInfo(long blockId, File blockFile, File metaFile, FsVolumeSpi vol) { String condensedVolPath = (vol == null || vol.getBaseURI() == null) ? null : getCondensedPath(new File(vol.getBaseURI()).getAbsolutePath()); {code} we addDifference if there is any mismatch in blockId or blockLength for that we call getMetaFile() and getBlockFile() , here we again use vol.getBaseUri {code:java} public File getMetaFile() { return new File(new File(volume.getBaseURI()).getAbsolutePath(), metaSuffix); {code} so if a DN has more blocks the calls to getBaseUri are more , and each time we call getBaseURI we care converting the currentDir.getParent to URI which is taking time and we can cache this here {code:java} public URI getBaseURI() { return new File(currentDir.getParent()).toURI(); } {code} on making this as cache , the lock time reduced to 52 Sec > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has taken nearly 5mins > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > | InstrumentedLock.java:143 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134907#comment-17134907 ] hemanthboyina commented on HDFS-15403: -- {quote}Shouldn't we call onFailure() when e.getMessage() is null? {quote} that's a valid point [~tasanuma] updated the patch , please review > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15403.001.patch, HDFS-15403.002.patch > > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104 > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15403: - Attachment: HDFS-15403.002.patch > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15403.001.patch, HDFS-15403.002.patch > > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104 > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134700#comment-17134700 ] hemanthboyina commented on HDFS-15403: -- thanks [~tasanuma] for the review {quote} we should call onFailure() when NPE {quote} the NPE here is because of e.getMessage() is null , which is not the exception thrown by try block here > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15403.001.patch > > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104 > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134177#comment-17134177 ] hemanthboyina commented on HDFS-15403: -- thanks for the comment [~elgoiri] , it was a rare scenario , i tried to reproduce , but couldn't > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15403.001.patch > > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104 > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133296#comment-17133296 ] hemanthboyina commented on HDFS-15391: -- {quote}The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. {quote} HDFS-15175 has reported similar kind of issue > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15408) Failed execution caused by SocketTimeoutException
[ https://issues.apache.org/jira/browse/HDFS-15408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133225#comment-17133225 ] hemanthboyina commented on HDFS-15408: -- thanks for the filing the issue [~echohlne] at present we have default socket timeout as 1min , do you think 1min is not sufficient to determine the http connection timeout ? > Failed execution caused by SocketTimeoutException > - > > Key: HDFS-15408 > URL: https://issues.apache.org/jira/browse/HDFS-15408 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.1.1 >Reporter: echohlne >Priority: Major > > When I execute command: hdfs fsck / > in the hadoop cluster to check the health of the cluster, It always report > an error execution failure like below: > {code} > Connecting to namenode via http://hadoop20:50070/fsck?ugi=hdfs=%2F > Exception in thread "main" java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587) > at > sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) > at org.apache.hadoop.hdfs.tools.DFSck.doWork(DFSck.java:359) > at org.apache.hadoop.hdfs.tools.DFSck.access$000(DFSck.java:72) > at org.apache.hadoop.hdfs.tools.DFSck$1.run(DFSck.java:159) > at org.apache.hadoop.hdfs.tools.DFSck$1.run(DFSck.java:156) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.hdfs.tools.DFSck.run(DFSck.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.hdfs.tools.DFSck.main(DFSck.java:402) > {code} > We try to solve this problem by adding a new parameter: > {color:#de350b}*dfs.fsck.http.timeout.ms*{color} to control the > connectionTimeout and the readTimeout if the HttpConnection in DFSck.java > .Please check is it the right way to solve the problem? thanks a lot! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130934#comment-17130934 ] hemanthboyina commented on HDFS-12969: -- any suggestions for this [~liuml07] [~elgoiri] > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Major > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130927#comment-17130927 ] hemanthboyina commented on HDFS-15351: -- yes [~elgoiri] , i think we can go ahead > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130923#comment-17130923 ] hemanthboyina commented on HDFS-15372: -- thanks for the patch [~sodonnell] , the changes looks good > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15372.001.patch, HDFS-15372.002.patch, > HDFS-15372.003.patch, HDFS-15372.004.patch, HDFS-15372.005.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. > The pre Hadoop 3.0 code passes the inode directly to the provider, and hence > it only ever sees the path as "/user/data/tab1". > It is debatable which path should be passed to the provider - > /user/.snapshot/snap1/tab or /data/tab1 in the case of snapshots. However as > the behaviour has changed I feel we should ensure the old behaviour is > retained. > It would also be fairly easy to provide a config switch so the provider gets > the full snapshot path or the resolved path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock
[ https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130521#comment-17130521 ] hemanthboyina commented on HDFS-15160: -- though the scanner takes more time to scan all the blocks , filed HDFS-15406 to improvise it > ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl > methods should use datanode readlock > --- > > Key: HDFS-15160 > URL: https://issues.apache.org/jira/browse/HDFS-15160 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15160.001.patch, HDFS-15160.002.patch, > HDFS-15160.003.patch, HDFS-15160.004.patch, HDFS-15160.005.patch, > HDFS-15160.006.patch, image-2020-04-10-17-18-08-128.png, > image-2020-04-10-17-18-55-938.png > > > Now we have HDFS-15150, we can start to move some DN operations to use the > read lock rather than the write lock to improve concurrence. The first step > is to make the changes to ReplicaMap, as many other methods make calls to it. > This Jira switches read operations against the volume map to use the readLock > rather than the write lock. > Additionally, some methods make a call to replicaMap.replicas() (eg > getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result > in a read only fashion, so they can also be switched to using a readLock. > Next is the directory scanner and disk balancer, which only require a read > lock. > Finally (for this Jira) are various "low hanging fruit" items in BlockSender > and fsdatasetImpl where is it fairly obvious they only need a read lock. > For now, I have avoided changing anything which looks too risky, as I think > its better to do any larger refactoring or risky changes each in their own > Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Description: In our customer cluster we have approx 10M blocks in one datanode the Datanode to scans all the blocks , it has taken nearly 5mins {code:java} 2020-06-10 12:17:06,869 | INFO | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 11149530, missing metadata files:472, missing block files:472, missing blocks in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 2020-06-10 12:17:06,869 | WARN | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) | InstrumentedLock.java:143 {code} was: In our customer cluster we have approx 10M blocks in one datanode When Datanode scans all the blocks , it has taken more time {code:java} 2020-06-10 12:17:06,869 | INFO | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 11149530, missing metadata files:472, missing block files:472, missing blocks in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 2020-06-10 12:17:06,869 | WARN | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) | InstrumentedLock.java:143 {code} > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > the Datanode to scans all the blocks , it has
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Description: In our customer cluster we have approx 10M blocks in one datanode When Datanode scans all the blocks , it has taken more time > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > When Datanode scans all the blocks , it has taken more time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15406) Improve the speed of Datanode Block Scan
[ https://issues.apache.org/jira/browse/HDFS-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15406: - Description: In our customer cluster we have approx 10M blocks in one datanode When Datanode scans all the blocks , it has taken more time {code:java} 2020-06-10 12:17:06,869 | INFO | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: 11149530, missing metadata files:472, missing block files:472, missing blocks in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 2020-06-10 12:17:06,869 | WARN | java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty queue] | Lock held time above threshold: lock identifier: org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) | InstrumentedLock.java:143 {code} was: In our customer cluster we have approx 10M blocks in one datanode When Datanode scans all the blocks , it has taken more time > Improve the speed of Datanode Block Scan > > > Key: HDFS-15406 > URL: https://issues.apache.org/jira/browse/HDFS-15406 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > In our customer cluster we have approx 10M blocks in one datanode > When Datanode scans all the blocks , it has taken more time > {code:java} > 2020-06-10 12:17:06,869 | INFO | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | BlockPool BP-1104115233-**.**.**.**-1571300215588 Total blocks: > 11149530, missing metadata files:472, missing block files:472, missing blocks > in memory:0, mismatched blocks:0 | DirectoryScanner.java:473 > 2020-06-10 12:17:06,869 | WARN | > java.util.concurrent.ThreadPoolExecutor$Worker@3b4bea70[State = -1, empty > queue] | Lock held time above threshold: lock identifier: > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl > lockHeldTimeMs=329854 ms. Suppressed 0 lock warnings. The stack trace is: > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) > org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148) > org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186) > org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133) > org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) > org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:475) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:375) > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:320) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >
[jira] [Created] (HDFS-15406) Improve the speed of Datanode Block Scan
hemanthboyina created HDFS-15406: Summary: Improve the speed of Datanode Block Scan Key: HDFS-15406 URL: https://issues.apache.org/jira/browse/HDFS-15406 Project: Hadoop HDFS Issue Type: Improvement Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock
[ https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130497#comment-17130497 ] hemanthboyina edited comment on HDFS-15160 at 6/10/20, 10:48 AM: - thanks [~pilchard] for the report , HDFS-15150 has introduced read and write lock in datanode With HDFS-15160 we acquire read lock for scanner , so the write wont be blocked was (Author: hemanthboyina): thanks [~pilchard] for the report , HDFS-15150 has introduced read and write lock in datanode > ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl > methods should use datanode readlock > --- > > Key: HDFS-15160 > URL: https://issues.apache.org/jira/browse/HDFS-15160 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15160.001.patch, HDFS-15160.002.patch, > HDFS-15160.003.patch, HDFS-15160.004.patch, HDFS-15160.005.patch, > HDFS-15160.006.patch, image-2020-04-10-17-18-08-128.png, > image-2020-04-10-17-18-55-938.png > > > Now we have HDFS-15150, we can start to move some DN operations to use the > read lock rather than the write lock to improve concurrence. The first step > is to make the changes to ReplicaMap, as many other methods make calls to it. > This Jira switches read operations against the volume map to use the readLock > rather than the write lock. > Additionally, some methods make a call to replicaMap.replicas() (eg > getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result > in a read only fashion, so they can also be switched to using a readLock. > Next is the directory scanner and disk balancer, which only require a read > lock. > Finally (for this Jira) are various "low hanging fruit" items in BlockSender > and fsdatasetImpl where is it fairly obvious they only need a read lock. > For now, I have avoided changing anything which looks too risky, as I think > its better to do any larger refactoring or risky changes each in their own > Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock
[ https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130497#comment-17130497 ] hemanthboyina commented on HDFS-15160: -- thanks [~pilchard] for the report , HDFS-15150 has introduced read and write lock in datanode > ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl > methods should use datanode readlock > --- > > Key: HDFS-15160 > URL: https://issues.apache.org/jira/browse/HDFS-15160 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15160.001.patch, HDFS-15160.002.patch, > HDFS-15160.003.patch, HDFS-15160.004.patch, HDFS-15160.005.patch, > HDFS-15160.006.patch, image-2020-04-10-17-18-08-128.png, > image-2020-04-10-17-18-55-938.png > > > Now we have HDFS-15150, we can start to move some DN operations to use the > read lock rather than the write lock to improve concurrence. The first step > is to make the changes to ReplicaMap, as many other methods make calls to it. > This Jira switches read operations against the volume map to use the readLock > rather than the write lock. > Additionally, some methods make a call to replicaMap.replicas() (eg > getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result > in a read only fashion, so they can also be switched to using a readLock. > Next is the directory scanner and disk balancer, which only require a read > lock. > Finally (for this Jira) are various "low hanging fruit" items in BlockSender > and fsdatasetImpl where is it fairly obvious they only need a read lock. > For now, I have avoided changing anything which looks too risky, as I think > its better to do any larger refactoring or risky changes each in their own > Jira. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15403: - Attachment: HDFS-15403.001.patch Status: Patch Available (was: Open) > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15403.001.patch > > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
hemanthboyina created HDFS-15403: Summary: NPE in FileIoProvider#transferToSocketFully Key: HDFS-15403 URL: https://issues.apache.org/jira/browse/HDFS-15403 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15403) NPE in FileIoProvider#transferToSocketFully
[ https://issues.apache.org/jira/browse/HDFS-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15403: - Description: {code:java} [DataXceiver for client at /127.0.0.1:41904 [Sending block BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: /127.0.0.1:34789java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) at java.lang.Thread.run(Thread.java:748) {code} > NPE in FileIoProvider#transferToSocketFully > --- > > Key: HDFS-15403 > URL: https://issues.apache.org/jira/browse/HDFS-15403 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > {code:java} > [DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789[DataXceiver for client at /127.0.0.1:41904 [Sending block > BP-293397713-127.0.1.1-1591535936877:blk_-9223372036854775786_1001]] ERROR > datanode.DataNode (DataXceiver.java:run(324)) - 127.0.0.1:34789:DataXceiver > error processing READ_BLOCK operation src: /127.0.0.1:41904 dst: > /127.0.0.1:34789java.lang.NullPointerException at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:283) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:614) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:809) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:756) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:610) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:292) > at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12969) DfsAdmin listOpenFiles should report files by type
[ https://issues.apache.org/jira/browse/HDFS-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129660#comment-17129660 ] hemanthboyina commented on HDFS-12969: -- now in the present code , we can list all open files or can list open files which are blocking ongoing decommission On calling dfsadmin -listOpenFiles -blockingDecommission we list only the files which are blocking decommission but On calling dfsadmin -listOpenFiles we list all open files , some of these open files can be blocking an ongoing decommission , So for listOpenFiles should we return the list based on type ? > DfsAdmin listOpenFiles should report files by type > -- > > Key: HDFS-12969 > URL: https://issues.apache.org/jira/browse/HDFS-12969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.0 >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Major > > HDFS-11847 has introduced a new option to {{-blockingDecommission}} to an > existing command > {{dfsadmin -listOpenFiles}}. But the reporting done by the command doesn't > differentiate the files based on the type (like blocking decommission). In > order to change the reporting style, the proto format used for the base > command has to be updated to carry additional fields and better be done in a > new jira outside of HDFS-11847. This jira is to track the end-to-end > enhancements needed for dfsadmin -listOpenFiles console output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128566#comment-17128566 ] hemanthboyina commented on HDFS-15372: -- thanks for the work [~sodonnell] , overall the code looks good some comments 1) AFAIK Only the Snapshot INode will have same Id as of INode's Parent Id , so you can use something like iip.getINode(iip.getLength-1).getId() != iip.getINode(iip.length()-1).getParent().getId() instead of checking !iip.isDotSnapshotDirPrefix() 2) In FSPermissionChecker we can get inodes path components by using INodesInPath#fromINode , but this method requires rootDir , which you have to get when FSDirectory calls FSPermissionChecker#checkTraverse or any other better way , upon this changes you can do same as you have done for FSDirectory#getAttributes kindly correct me if i am wrong , thanks > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15372.001.patch, HDFS-15372.002.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. > The pre Hadoop 3.0 code passes the inode directly to the provider, and hence > it only ever sees the path as "/user/data/tab1". > It is debatable which path should be passed to the provider - > /user/.snapshot/snap1/tab or /data/tab1 in the case of snapshots. However as > the behaviour has changed I feel we should ensure the old behaviour is > retained. > It would also be fairly easy to provide a config switch so the provider gets > the full snapshot path or the resolved path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15390) client fails forever when namenode ipaddr changed
[ https://issues.apache.org/jira/browse/HDFS-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128243#comment-17128243 ] hemanthboyina commented on HDFS-15390: -- [~seanlook] you can click on the More option specified under the Jira title , In More you can select MOVE > client fails forever when namenode ipaddr changed > - > > Key: HDFS-15390 > URL: https://issues.apache.org/jira/browse/HDFS-15390 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.0, 2.9.2, 3.2.1 >Reporter: Sean Chow >Priority: Major > Attachments: HDFS-15390.01.patch > > > For machine replacement, I replace my standby namenode with a new ipaddr and > keep the same hostname. Also update the client's hosts to make it resolve > correctly > When I try to run failover to transite the new namenode(let's say nn2), the > client will fail to read or write forever until it's restarted. > That make yarn nodemanager in sick state. Even the new tasks will encounter > this exception too. Until all nodemanager restart. > > {code:java} > 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: > nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000 > 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to > nn2-192-168-1-100/192.168.1.200:9000: Connection refused > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517) > at org.apache.hadoop.ipc.Client.call(Client.java:1440) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399) > at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > {code} > > We can see the client has {{Address change detected}}, but it still fails. I > find out that's because when method {{updateAddress()}} return true, the > {{handleConnectionFailure()}} thow an exception that break the next retry > with the right ipaddr. > Client.java: setupConnection() > {code:java} > } catch (ConnectTimeoutException toe) { > /* Check for an address change and update the local reference. >* Reset the failure counter if the address was changed >*/ > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > handleConnectionTimeout(timeoutFailures++, > maxRetriesOnSocketTimeouts, toe); > } catch (IOException ie) { > if (updateAddress()) { > timeoutFailures = ioFailures = 0; > } > // because the namenode ip changed in updateAddress(), the old namenode > ipaddress cannot be accessed now > // handleConnectionFailure will thow an exception, the next retry never have > a chance to use the right server updated in updateAddress() > handleConnectionFailure(ioFailures++, ie); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127661#comment-17127661 ] hemanthboyina commented on HDFS-15372: -- thanks for the very clear explanation [~sodonnell] > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15372.001.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. > The pre Hadoop 3.0 code passes the inode directly to the provider, and hence > it only ever sees the path as "/user/data/tab1". > It is debatable which path should be passed to the provider - > /user/.snapshot/snap1/tab or /data/tab1 in the case of snapshots. However as > the behaviour has changed I feel we should ensure the old behaviour is > retained. > It would also be fairly easy to provide a config switch so the provider gets > the full snapshot path or the resolved path. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15378) TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk
[ https://issues.apache.org/jira/browse/HDFS-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127656#comment-17127656 ] hemanthboyina commented on HDFS-15378: -- thanks for the comment [~elgoiri] i just had taken some time to understand why exactly we are facing the issue here , and had attached some logs in test and at while getting xceiver count and xmits count Datanode (127.0.0.1:38635) has received the command for recovery {code:java} 2020-06-07 18:55:10,052 [Command processor] INFO datanode.DataNode (BPOfferService.java:processCommandFromActive(795)) - DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY 2020-06-07 18:55:10,086 [DataXceiver for client at /127.0.0.1:47330 [Receiving block BP-1804107793-127.0.1.1-1591536306390:blk_-9223372036854775787_1001]] INFO datanode.DataNode (DataXceiver.java:writeBlock(747)) - Receiving BP-1804107793-127.0.1.1-1591536306390:blk_-9223372036854775787_1001 src: /127.0.0.1:47330 dest: /127.0.0.1:38635 {code} in Test , Datanode (127.0.0.1:38635) has received check for xciver count and its xmits count , as the xceiver count was 1 the test went to next line where we are asserting xmits count without a wait , meanwhile reconstruction work is happening in parallel {code:java} ##2020-06-07 18:55:10,575 [Listener at localhost/42653] WARN hdfs.TestReconstructStripedFile (TestReconstructStripedFile.java:testErasureCodingWorkerXmitsWeight(577)) - called test xciver count = 1 DatanodeRegistration(127.0.0.1:38635, datanodeUuid=d6f1f0ed-be4d-403a-bc00-32d950681c2a, infoPort=38759, infoSecurePort=0, ipcPort=35273, storageInfo=lv=-57;cid=testClusterID;nsid=1421432615;c=1591536306390) 2020-06-07 18:55:10,575 [Listener at localhost/42653] WARN datanode.DataNode (DataNode.java:getXceiverCount(2252)) - called xciver count on test call 1 2020-06-07 18:55:10,576 [Listener at localhost/42653] WARN hdfs.TestReconstructStripedFile (TestReconstructStripedFile.java:testErasureCodingWorkerXmitsWeight(580)) - called test xmits count 2020-06-07 18:55:10,576 [Listener at localhost/42653] WARN datanode.DataNode (DataNode.java:getXmitsInProgress(2292)) - called xmits on test call3 2020-06-07 18:55:10,637 [Listener at localhost/42653] WARN datanode.DataNode (DataNode.java:getXmitsInProgress(2292)) - called xmits on test call3 ##2020-06-07 18:55:10,662 [StripedBlockReconstruction-0] WARN datanode.DataNode (StripedBlockReconstructor.java:run(74)) - Reconstruction happened{code} upon waiting , the xmits has become zero and on completing the task the xciever count also become zero {code:java} 2020-06-07 18:55:10,663 [Block report processor] DEBUG BlockStateChange (BlockManager.java:processIncrementalBlockReport(4331)) - *BLOCK* NameNode.processIncrementalBlockReport: from 127.0.0.1:38635 receiving: 0, received: 1, deleted: 0 2020-06-07 18:55:10,665 [Listener at localhost/39641] WARN datanode.DataNode (DataNode.java:getXceiverCount(2252)) - called xciver count on test call 0 ##2020-06-07 18:55:10,665 [Listener at localhost/42653] WARN hdfs.TestReconstructStripedFile (TestReconstructStripedFile.java:testErasureCodingWorkerXmitsWeight(577)) - called test xciver count = 0 DatanodeRegistration(127.0.0.1:38635, datanodeUuid=d6f1f0ed-be4d-403a-bc00-32d950681c2a, infoPort=38759, infoSecurePort=0, ipcPort=35273, storageInfo=lv=-57;cid=testClusterID;nsid=1421432615;c=1591536306390){code} > TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on > trunk > - > > Key: HDFS-15378 > URL: https://issues.apache.org/jira/browse/HDFS-15378 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Priority: Major > Attachments: HDFS-15378.001.patch > > > [https://builds.apache.org/job/PreCommit-HDFS-Build/29377/#showFailuresLink] > [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127413#comment-17127413 ] hemanthboyina commented on HDFS-15351: -- thanks [~belugabehr] for the review > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15372) Files in snapshots no longer see attribute provider permissions
[ https://issues.apache.org/jira/browse/HDFS-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127412#comment-17127412 ] hemanthboyina commented on HDFS-15372: -- thanks for good analysis [~sodonnell] {quote}With the 001 patch in place, if you try to list /data/.snapshot/snapshot_1, the path seen by the attribute provider is: /user/snapshot_1 Before, it was: /user/.snapshot/snapshot1 When checking a path like /data/.snapshot/snap1 the provider will see /data/snap1, but on the branch-2, it would have seen /data/.snapshot/snap1. {quote} is the path seen by the attribute provider for branch and trunk was same ? it was bit confusing , can you add all in one comment with an example for a snapshot path If we try list for a path , the path will be resolved as Inodes from InodeInPath , and the same inodes components will be used by the provider right ? and INodesInPath handles .snapshot part of a path While creating a snapshot we add the inode directory as the root to snapshot {code:java} DirectorySnapshottableFeature#createSnaphot public Snapshot addSnapshot(INodeDirectory snapshotRoot, int id, String name, final Snapshot s = new Snapshot(id, name, snapshotRoot); {code} While getting inodesInPath for a file in snapshot we use the root of snapshot to get the file , IMO that means the if the file has an acl the file under snapshot root should have acl {code:java} if (isDotSnapshotDir(childName) && dir.isSnapshottable()) { final Snapshot s = dir.getSnapshot(components[count + 1]); else { curNode = s.getRoot(); snapshotId = s.getId(); } {code} please correct me if am missing some thing here > Files in snapshots no longer see attribute provider permissions > --- > > Key: HDFS-15372 > URL: https://issues.apache.org/jira/browse/HDFS-15372 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-15372.001.patch > > > Given a cluster with an authorization provider configured (eg Sentry) and the > paths covered by the provider are snapshotable, there was a change in > behaviour in how the provider permissions and ACLs are applied to files in > snapshots between the 2.x branch and Hadoop 3.0. > Eg, if we have the snapshotable path /data, which is Sentry managed. The ACLs > below are provided by Sentry: > {code} > hadoop fs -getfacl -R /data > # file: /data > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/tab1 > # owner: hive > # group: hive > user::rwx > group::--- > group:flume:rwx > user:hive:rwx > group:hive:rwx > group:testgroup:rwx > mask::rwx > other::--x > /data/tab1 > {code} > After taking a snapshot, the files in the snapshot do not see the provider > permissions: > {code} > hadoop fs -getfacl -R /data/.snapshot > # file: /data/.snapshot > # owner: > # group: > user::rwx > group::rwx > other::rwx > # file: /data/.snapshot/snap1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > # file: /data/.snapshot/snap1/tab1 > # owner: hive > # group: hive > user::rwx > group::rwx > other::--x > {code} > However pre-Hadoop 3.0 (when the attribute provider etc was extensively > refactored) snapshots did get the provider permissions. > The reason is this code in FSDirectory.java which ultimately calls the > attribute provider and passes the path we want permissions for: > {code} > INodeAttributes getAttributes(INodesInPath iip) > throws IOException { > INode node = FSDirectory.resolveLastINode(iip); > int snapshot = iip.getPathSnapshotId(); > INodeAttributes nodeAttrs = node.getSnapshotINode(snapshot); > UserGroupInformation ugi = NameNode.getRemoteUser(); > INodeAttributeProvider ap = this.getUserFilteredAttributeProvider(ugi); > if (ap != null) { > // permission checking sends the full components array including the > // first empty component for the root. however file status > // related calls are expected to strip out the root component according > // to TestINodeAttributeProvider. > byte[][] components = iip.getPathComponents(); > components = Arrays.copyOfRange(components, 1, components.length); > nodeAttrs = ap.getAttributes(components, nodeAttrs); > } > return nodeAttrs; > } > {code} > The line: > {code} > INode node = FSDirectory.resolveLastINode(iip); > {code} > Picks the last resolved Inode and if you then call node.getPathComponents, > for a path like '/data/.snapshot/snap1/tab1' it will return /data/tab1. It > resolves the snapshot path to its original location, but its still the > snapshot inode. > However the logic passes 'iip.getPathComponents' which returns > "/user/.snapshot/snap1/tab" to the provider. >
[jira] [Commented] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125141#comment-17125141 ] hemanthboyina commented on HDFS-15246: -- thanks for the review ![~elgoiri] test failures are not related to this jira > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch, > HDFS-15246.002.patch, HDFS-15246.003.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124998#comment-17124998 ] hemanthboyina commented on HDFS-15375: -- test failures were not related {quote}We can't remove {{pendingNum}} from here, it will create extra replication task if this count doesn't include pendingNum {quote} i think it does not create extra replication task , because the pendingNum count is for selecting in which priority level the block should be added or updated in priority queue > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120463#comment-17120463 ] hemanthboyina commented on HDFS-15246: -- thanks for the review [~elgoiri] have updated the patch , please review > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch, > HDFS-15246.002.patch, HDFS-15246.003.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15246: - Attachment: HDFS-15246.003.patch > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch, > HDFS-15246.002.patch, HDFS-15246.003.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119811#comment-17119811 ] hemanthboyina commented on HDFS-15375: -- thanks [~surendrasingh] for the comment we have a configuration dfs.namenode.reconstruction.pending.timeout-sec which is by default 5mins , after 5mins the blocks in pending reconstruction will be timedout and will be moved to needed reconstruction by redundancy monitor thread , so now on moving to needed reconstruction the block will be kept on QUEUE_WITH_CORRUPT_BLOCKS and even fsck uses this priority queue to get corrupt blocks by QUEUE_WITH_CORRUPT_BLOCKS , so data mismatch will be happen here too > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14901) RBF: Add Encryption Zone related ClientProtocol APIs
[ https://issues.apache.org/jira/browse/HDFS-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119660#comment-17119660 ] hemanthboyina commented on HDFS-14901: -- we have hit this issue again {quote}Should we do the federated token approach {quote} thanks [~elgoiri] , but i think this doesn't work out , as router doesn't know for which name service the getDataEncryptionKey call has happened We are thinking of adding Name Service in RPC header (an Incompatible change) , so router can get the key from that NS please say your opinion on this > RBF: Add Encryption Zone related ClientProtocol APIs > > > Key: HDFS-14901 > URL: https://issues.apache.org/jira/browse/HDFS-14901 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14901.001.patch, HDFS-14901.002.patch, > HDFS-14901.003.patch > > > Currently listEncryptionZones,reencryptEncryptionZone,listReencryptionStatus > these APIs are not implemented in Router. > This JIRA is intend to implement above mentioned APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119220#comment-17119220 ] hemanthboyina commented on HDFS-15246: -- thanks [~elgoiri] for review i have updated the patch , please review > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch, > HDFS-15246.002.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15246: - Attachment: HDFS-15246.002.patch > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch, > HDFS-15246.002.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15378) TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk
[ https://issues.apache.org/jira/browse/HDFS-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15378: - Attachment: HDFS-15378.001.patch Status: Patch Available (was: Open) > TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on > trunk > - > > Key: HDFS-15378 > URL: https://issues.apache.org/jira/browse/HDFS-15378 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Priority: Major > Attachments: HDFS-15378.001.patch > > > [https://builds.apache.org/job/PreCommit-HDFS-Build/29377/#showFailuresLink] > [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15378) TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk
[ https://issues.apache.org/jira/browse/HDFS-15378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118963#comment-17118963 ] hemanthboyina commented on HDFS-15378: -- there was a slight delay in the test case run , if we wait for curDn.getXmitsInProgress() == 0 , the test case was success > TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on > trunk > - > > Key: HDFS-15378 > URL: https://issues.apache.org/jira/browse/HDFS-15378 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Priority: Major > > [https://builds.apache.org/job/PreCommit-HDFS-Build/29377/#showFailuresLink] > [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15378) TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk
hemanthboyina created HDFS-15378: Summary: TestReconstructStripedFile#testErasureCodingWorkerXmitsWeight is failing on trunk Key: HDFS-15378 URL: https://issues.apache.org/jira/browse/HDFS-15378 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina [https://builds.apache.org/job/PreCommit-HDFS-Build/29377/#showFailuresLink] [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117993#comment-17117993 ] hemanthboyina commented on HDFS-15351: -- {quote} I've always hated this list to array interface... [~belugabehr] you are the expert on these things; any alternative? {quote} hi [~belugabehr] , you have any alternatives or suggestions for this ? > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15246) ArrayIndexOfboundsException in BlockManager CreateLocatedBlock
[ https://issues.apache.org/jira/browse/HDFS-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15246: - Attachment: HDFS-15246.001.patch Status: Patch Available (was: Open) > ArrayIndexOfboundsException in BlockManager CreateLocatedBlock > -- > > Key: HDFS-15246 > URL: https://issues.apache.org/jira/browse/HDFS-15246 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15246-testrepro.patch, HDFS-15246.001.patch > > > java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1362) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1501) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2047) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:770) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117980#comment-17117980 ] hemanthboyina commented on HDFS-15375: -- ran test failures in local , seems not related org.apache.hadoop.hdfs.TestReconstructStripedFile.testErasureCodingWorkerXmitsWeight org.apache.hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy.testErasureCodingWorkerXmitsWeight these tests were failing even without this patch , following up on these tests , found they were failing continonusly [https://builds.apache.org/job/PreCommit-HDFS-Build/29368/] [https://builds.apache.org/job/PreCommit-HDFS-Build/29366/|https://builds.apache.org/job/PreCommit-HDFS-Build/29366/#showFailuresLink] [https://builds.apache.org/job/PreCommit-HDFS-Build/29358/] > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15376) Update the error about command line POST in httpfs documentation
[ https://issues.apache.org/jira/browse/HDFS-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116921#comment-17116921 ] hemanthboyina commented on HDFS-15376: -- No , this is fine [~elgoiri] > Update the error about command line POST in httpfs documentation > > > Key: HDFS-15376 > URL: https://issues.apache.org/jira/browse/HDFS-15376 > Project: Hadoop HDFS > Issue Type: Improvement > Components: httpfs >Affects Versions: 3.2.1 >Reporter: bianqi >Assignee: bianqi >Priority: Major > Attachments: HDFS-15376.001.patch > > > In the official Hadoop documentation, there is an exception when executing > the following command. > {quote} {{curl -X POST > 'http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=MKDIRS=foo'}} > creates the HDFS {{/user/foo/bar}} directory. > {quote} > Command line returns results: > {quote} *{"RemoteException":{"message":"Invalid HTTP POST operation > [MKDIRS]","exception":"IOException","javaClassName":"java.io.IOException"}}* > {quote} > > I checked the source code and found that the way to create the file should > use PUT to submit the form. > I modified to execute the command in PUT mode and got the result as > follows > {quote} {{curl -X PUT > 'http://httpfs-host:14000/webhdfs/v1/user/foo/bar?op=MKDIRS=foo'}} > creates the HDFS {{/user/foo/bar}} directory. > {quote} > Command line returns results: > {"boolean":true} > . At the same time the folder is created successfully. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15375: - Attachment: HDFS-15375.001.patch Status: Patch Available (was: Open) > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14984) HDFS setQuota: Error message should be added for invalid input max range value to hdfs dfsadmin -setQuota command
[ https://issues.apache.org/jira/browse/HDFS-14984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116475#comment-17116475 ] hemanthboyina commented on HDFS-14984: -- thanks for the interest [~zhaoyim] you can work on this issue and assign this issue to yourself > HDFS setQuota: Error message should be added for invalid input max range > value to hdfs dfsadmin -setQuota command > - > > Key: HDFS-14984 > URL: https://issues.apache.org/jira/browse/HDFS-14984 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.1.2 >Reporter: Souryakanta Dwivedy >Priority: Minor > Attachments: image-2019-11-13-14-05-19-603.png, > image-2019-11-13-14-07-04-536.png > > > An error message should be added for invalid input max range value > "9223372036854775807" to hdfs dfsadmin -setQuota command > * set quota for a directory with invalid input vlaue as > "9223372036854775807"- set quota for a directory with invalid input vlaue as > "9223372036854775807" the command will be successful without displaying any > result.Quota value will not be set for the directory internally,but it > will be better from user usage point of view if an error message will > display for the invalid max range value "9223372036854775807" as it is > displaying while setting the input value as "0" For example "hdfs > dfsadmin -setQuota 9223372036854775807 /quota" > !image-2019-11-13-14-05-19-603.png! > > * - Try to set quota for a directory with invalid input value as "0" It > will throw an error message as "setQuota: Invalid values for quota : 0 and > 9223372036854775807" For example "hdfs dfsadmin -setQuota 0 /quota" > !image-2019-11-13-14-07-04-536.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15375: - Description: In BlockManager#updateNeededReconstructions , while updating the NeededReconstruction we are adding Pendingreconstruction blocks to live replicas {code:java} int pendingNum = pendingReconstruction.getNumReplicas(block); int curExpectedReplicas = getExpectedRedundancyNum(block); if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { neededReconstruction.update(block, repl.liveReplicas() + pendingNum,{code} But if two replicas were in pending reconstruction (due to corruption) , and if the third replica is corrupted the block should be in QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a corrupted block , which is wrong > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15375: - Attachment: HDFS-15375-testrepro.patch > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
hemanthboyina created HDFS-15375: Summary: Reconstruction Work should not happen for Corrupt Block Key: HDFS-15375 URL: https://issues.apache.org/jira/browse/HDFS-15375 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15288) Add Available Space Rack Fault Tolerant BPP
[ https://issues.apache.org/jira/browse/HDFS-15288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115122#comment-17115122 ] hemanthboyina edited comment on HDFS-15288 at 5/24/20, 11:07 AM: - Good work [~ayushtkn] , just a small query In AvailableSpaceRackFaultTolerantBlockPlacementPolicy#chooseDataNode , instead of chooseRandomWithStorageType can we use chooseRandomWithStorageTypeTwoTrial ? , as it was in AvailableSpaceBpp was (Author: hemanthboyina): Good work [~ayushtkn] , just a small doubt > Add Available Space Rack Fault Tolerant BPP > --- > > Key: HDFS-15288 > URL: https://issues.apache.org/jira/browse/HDFS-15288 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15288-01.patch, HDFS-15288-02.patch, > HDFS-15288-03.patch > > > The Present {{AvailableSpaceBlockPlacementPolicy}} extends the Default Block > Placement policy, which makes it apt for Replicated files. But not very > efficient for EC files, which by default use. > {{BlockPlacementPolicyRackFaultTolerant}}. So propose a to add new BPP having > similar optimization as ASBPP where as keeping the spread of Blocks to max > racks, i.e as RackFaultTolerantBPP. > This could extend {{BlockPlacementPolicyRackFaultTolerant}}, rather than the > {{BlockPlacementPOlicyDefault}} like ASBPP and keep other logics of > optimization same as ASBPP -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15288) Add Available Space Rack Fault Tolerant BPP
[ https://issues.apache.org/jira/browse/HDFS-15288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115122#comment-17115122 ] hemanthboyina commented on HDFS-15288: -- Good work [~ayushtkn] , just a small doubt > Add Available Space Rack Fault Tolerant BPP > --- > > Key: HDFS-15288 > URL: https://issues.apache.org/jira/browse/HDFS-15288 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15288-01.patch, HDFS-15288-02.patch, > HDFS-15288-03.patch > > > The Present {{AvailableSpaceBlockPlacementPolicy}} extends the Default Block > Placement policy, which makes it apt for Replicated files. But not very > efficient for EC files, which by default use. > {{BlockPlacementPolicyRackFaultTolerant}}. So propose a to add new BPP having > similar optimization as ASBPP where as keeping the spread of Blocks to max > racks, i.e as RackFaultTolerantBPP. > This could extend {{BlockPlacementPolicyRackFaultTolerant}}, rather than the > {{BlockPlacementPOlicyDefault}} like ASBPP and keep other logics of > optimization same as ASBPP -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
[ https://issues.apache.org/jira/browse/HDFS-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113450#comment-17113450 ] hemanthboyina commented on HDFS-15362: -- thanks [~elgoiri] for the review i have updated the patch , please review > FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all > distinct blocks > -- > > Key: HDFS-15362 > URL: https://issues.apache.org/jira/browse/HDFS-15362 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15362.001.patch, HDFS-15362.002.patch > > > FileWithSnapshotFeature#updateQuotaAndCollectBlocks uses list to collect > blocks > {code:java} > List allBlocks = new ArrayList(); > if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); > }{code} > INodeFile#storagespaceConsumedContiguous collects all distinct blocks by set > {code:java} > // Collect all distinct blocks > Set allBlocks = new HashSet<>(Arrays.asList(getBlocks())); > DiffList diffs = sf.getDiffs().asList(); > for(FileDiff diff : diffs) { >BlockInfo[] diffBlocks = diff.getBlocks(); >if (diffBlocks != null) { > allBlocks.addAll(Arrays.asList(diffBlocks)); > } {code} > but on updating the reclaim context we subtract these both , so wrong quota > value can be updated > {code:java} > QuotaCounts current = file.storagespaceConsumed(bsp); > reclaimContext.quotaDelta().add(oldCounts.subtract(current)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
[ https://issues.apache.org/jira/browse/HDFS-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15362: - Attachment: HDFS-15362.002.patch > FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all > distinct blocks > -- > > Key: HDFS-15362 > URL: https://issues.apache.org/jira/browse/HDFS-15362 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15362.001.patch, HDFS-15362.002.patch > > > FileWithSnapshotFeature#updateQuotaAndCollectBlocks uses list to collect > blocks > {code:java} > List allBlocks = new ArrayList(); > if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); > }{code} > INodeFile#storagespaceConsumedContiguous collects all distinct blocks by set > {code:java} > // Collect all distinct blocks > Set allBlocks = new HashSet<>(Arrays.asList(getBlocks())); > DiffList diffs = sf.getDiffs().asList(); > for(FileDiff diff : diffs) { >BlockInfo[] diffBlocks = diff.getBlocks(); >if (diffBlocks != null) { > allBlocks.addAll(Arrays.asList(diffBlocks)); > } {code} > but on updating the reclaim context we subtract these both , so wrong quota > value can be updated > {code:java} > QuotaCounts current = file.storagespaceConsumed(bsp); > reclaimContext.quotaDelta().add(oldCounts.subtract(current)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
[ https://issues.apache.org/jira/browse/HDFS-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111421#comment-17111421 ] hemanthboyina commented on HDFS-15362: -- thanks [~elgoiri] for the review we call updateQuotaAndCollectBlocks only once at the end , so i have done assert only that time > FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all > distinct blocks > -- > > Key: HDFS-15362 > URL: https://issues.apache.org/jira/browse/HDFS-15362 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15362.001.patch > > > FileWithSnapshotFeature#updateQuotaAndCollectBlocks uses list to collect > blocks > {code:java} > List allBlocks = new ArrayList(); > if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); > }{code} > INodeFile#storagespaceConsumedContiguous collects all distinct blocks by set > {code:java} > // Collect all distinct blocks > Set allBlocks = new HashSet<>(Arrays.asList(getBlocks())); > DiffList diffs = sf.getDiffs().asList(); > for(FileDiff diff : diffs) { >BlockInfo[] diffBlocks = diff.getBlocks(); >if (diffBlocks != null) { > allBlocks.addAll(Arrays.asList(diffBlocks)); > } {code} > but on updating the reclaim context we subtract these both , so wrong quota > value can be updated > {code:java} > QuotaCounts current = file.storagespaceConsumed(bsp); > reclaimContext.quotaDelta().add(oldCounts.subtract(current)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15363) BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology
[ https://issues.apache.org/jira/browse/HDFS-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15363: - Attachment: HDFS-15363.001.patch Status: Patch Available (was: Open) > BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology > -- > > Key: HDFS-15363 > URL: https://issues.apache.org/jira/browse/HDFS-15363 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15363-testrepro.patch, HDFS-15363.001.patch > > > BlockPlacementPolicyWithNodeGroup type casts the initialized clusterMap > {code:java} > NetworkTopologyWithNodeGroup clusterMapNodeGroup = > (NetworkTopologyWithNodeGroup) clusterMap {code} > If clusterMap is an instance of DFSNetworkTopology we get a ClassCastException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15363) BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology
[ https://issues.apache.org/jira/browse/HDFS-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15363: - Attachment: HDFS-15363-testrepro.patch > BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology > -- > > Key: HDFS-15363 > URL: https://issues.apache.org/jira/browse/HDFS-15363 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15363-testrepro.patch > > > BlockPlacementPolicyWithNodeGroup type casts the initialized clusterMap > {code:java} > NetworkTopologyWithNodeGroup clusterMapNodeGroup = > (NetworkTopologyWithNodeGroup) clusterMap {code} > If clusterMap is an instance of DFSNetworkTopology we get a ClassCastException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15363) BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology
[ https://issues.apache.org/jira/browse/HDFS-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15363: - Description: BlockPlacementPolicyWithNodeGroup type casts the initialized clusterMap {code:java} NetworkTopologyWithNodeGroup clusterMapNodeGroup = (NetworkTopologyWithNodeGroup) clusterMap {code} If clusterMap is an instance of DFSNetworkTopology we get a ClassCastException > BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology > -- > > Key: HDFS-15363 > URL: https://issues.apache.org/jira/browse/HDFS-15363 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > BlockPlacementPolicyWithNodeGroup type casts the initialized clusterMap > {code:java} > NetworkTopologyWithNodeGroup clusterMapNodeGroup = > (NetworkTopologyWithNodeGroup) clusterMap {code} > If clusterMap is an instance of DFSNetworkTopology we get a ClassCastException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15363) BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology
hemanthboyina created HDFS-15363: Summary: BlockPlacementPolicyWithNodeGroup should validate initialized NetworkTopology Key: HDFS-15363 URL: https://issues.apache.org/jira/browse/HDFS-15363 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
[ https://issues.apache.org/jira/browse/HDFS-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15362: - Description: FileWithSnapshotFeature#updateQuotaAndCollectBlocks uses list to collect blocks {code:java} List allBlocks = new ArrayList(); if (file.getBlocks() != null) { allBlocks.addAll(Arrays.asList(file.getBlocks())); }{code} INodeFile#storagespaceConsumedContiguous collects all distinct blocks by set {code:java} // Collect all distinct blocks Set allBlocks = new HashSet<>(Arrays.asList(getBlocks())); DiffList diffs = sf.getDiffs().asList(); for(FileDiff diff : diffs) { BlockInfo[] diffBlocks = diff.getBlocks(); if (diffBlocks != null) { allBlocks.addAll(Arrays.asList(diffBlocks)); } {code} but on updating the reclaim context we subtract these both , so wrong quota value can be updated {code:java} QuotaCounts current = file.storagespaceConsumed(bsp); reclaimContext.quotaDelta().add(oldCounts.subtract(current)); {code} > FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all > distinct blocks > -- > > Key: HDFS-15362 > URL: https://issues.apache.org/jira/browse/HDFS-15362 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15362.001.patch > > > FileWithSnapshotFeature#updateQuotaAndCollectBlocks uses list to collect > blocks > {code:java} > List allBlocks = new ArrayList(); > if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); > }{code} > INodeFile#storagespaceConsumedContiguous collects all distinct blocks by set > {code:java} > // Collect all distinct blocks > Set allBlocks = new HashSet<>(Arrays.asList(getBlocks())); > DiffList diffs = sf.getDiffs().asList(); > for(FileDiff diff : diffs) { >BlockInfo[] diffBlocks = diff.getBlocks(); >if (diffBlocks != null) { > allBlocks.addAll(Arrays.asList(diffBlocks)); > } {code} > but on updating the reclaim context we subtract these both , so wrong quota > value can be updated > {code:java} > QuotaCounts current = file.storagespaceConsumed(bsp); > reclaimContext.quotaDelta().add(oldCounts.subtract(current)); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
[ https://issues.apache.org/jira/browse/HDFS-15362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15362: - Attachment: HDFS-15362.001.patch Status: Patch Available (was: Open) > FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all > distinct blocks > -- > > Key: HDFS-15362 > URL: https://issues.apache.org/jira/browse/HDFS-15362 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15362.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15362) FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks
hemanthboyina created HDFS-15362: Summary: FileWithSnapshotFeature#updateQuotaAndCollectBlocks should collect all distinct blocks Key: HDFS-15362 URL: https://issues.apache.org/jira/browse/HDFS-15362 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"
[ https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109947#comment-17109947 ] hemanthboyina commented on HDFS-14762: -- i think this issue also relates to IPV6 > "Path(Path/String parent, String child)" will fail when "child" contains ":" > > > Key: HDFS-14762 > URL: https://issues.apache.org/jira/browse/HDFS-14762 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shixiong Zhu >Priority: Major > Attachments: HDFS-14762.001.patch, HDFS-14762.002.patch, > HDFS-14762.003.patch, HDFS-14762.004.patch > > > When the "child" parameter contains ":", "Path(Path/String parent, String > child)" will throw the following exception: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: ... > {code} > Not sure if this is a legit bug. But the following places will hit this error > when seeing a Path with a file name containing ":": > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101 > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15351: - Attachment: HDFS-15351.003.patch > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106503#comment-17106503 ] hemanthboyina commented on HDFS-15351: -- thanks for the review [~elgoiri] have updated the patch , please review > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch, > HDFS-15351.003.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106486#comment-17106486 ] hemanthboyina commented on HDFS-15351: -- thanks for the review [~elgoiri] i have updated the patch , please review > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15351: - Attachment: HDFS-15351.002.patch > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch, HDFS-15351.002.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15308) TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky
[ https://issues.apache.org/jira/browse/HDFS-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104747#comment-17104747 ] hemanthboyina edited comment on HDFS-15308 at 5/11/20, 6:47 PM: thanks for the work here i think dfs.namenode.reconstruction.pending.timeout-sec should be lesser {code:java} long period = Math.min(DEFAULT_RECHECK_INTERVAL, timeout); try { pendingReconstructionCheck(); Thread.sleep(period); } {code} to get zero timed out pending reconstructions , the timeout should be more as [~touchida] mentioned was (Author: hemanthboyina): thanks for the work here i think dfs.namenode.reconstruction.pending.timeout-sec should be lesser {code:java} long period = Math.min(DEFAULT_RECHECK_INTERVAL, timeout); try { pendingReconstructionCheck(); Thread.sleep(period); } {code} > TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky > - > > Key: HDFS-15308 > URL: https://issues.apache.org/jira/browse/HDFS-15308 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.3.0 >Reporter: Toshihiko Uchida >Priority: Minor > Labels: flaky-test > Attachments: HDFS-15308.001.patch > > > In HDFS-14353, TestReconstructStripedFile.testNNSendsErasureCodingTasks > failed once due to pending reconstruction timeout as follows. > {code} > java.lang.AssertionError: Found 4 timeout pending reconstruction tasks > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:502) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:458) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} > The error occurred on the following assertion. > {code} > // Make sure that all pending reconstruction tasks can be processed. > while (ns.getPendingReconstructionBlocks() > 0) { > long timeoutPending = ns.getNumTimedOutPendingReconstructions(); > assertTrue(String.format("Found %d timeout pending reconstruction tasks", > timeoutPending), timeoutPending == 0); > Thread.sleep(1000); > } > {code} > The failure could not be reproduced in the reporter's docker environment > (start-build-environment.sh). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15308) TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky
[ https://issues.apache.org/jira/browse/HDFS-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104747#comment-17104747 ] hemanthboyina commented on HDFS-15308: -- thanks for the work here i think dfs.namenode.reconstruction.pending.timeout-sec should be lesser {code:java} long period = Math.min(DEFAULT_RECHECK_INTERVAL, timeout); try { pendingReconstructionCheck(); Thread.sleep(period); } {code} > TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky > - > > Key: HDFS-15308 > URL: https://issues.apache.org/jira/browse/HDFS-15308 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.3.0 >Reporter: Toshihiko Uchida >Priority: Minor > Labels: flaky-test > Attachments: HDFS-15308.001.patch > > > In HDFS-14353, TestReconstructStripedFile.testNNSendsErasureCodingTasks > failed once due to pending reconstruction timeout as follows. > {code} > java.lang.AssertionError: Found 4 timeout pending reconstruction tasks > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:502) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:458) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} > The error occurred on the following assertion. > {code} > // Make sure that all pending reconstruction tasks can be processed. > while (ns.getPendingReconstructionBlocks() > 0) { > long timeoutPending = ns.getNumTimedOutPendingReconstructions(); > assertTrue(String.format("Found %d timeout pending reconstruction tasks", > timeoutPending), timeoutPending == 0); > Thread.sleep(1000); > } > {code} > The failure could not be reproduced in the reporter's docker environment > (start-build-environment.sh). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15308) TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky
[ https://issues.apache.org/jira/browse/HDFS-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15308: - Attachment: HDFS-15308.001.patch Status: Patch Available (was: Open) > TestReconstructStripedFile.testNNSendsErasureCodingTasks is flaky > - > > Key: HDFS-15308 > URL: https://issues.apache.org/jira/browse/HDFS-15308 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Affects Versions: 3.3.0 >Reporter: Toshihiko Uchida >Priority: Minor > Labels: flaky-test > Attachments: HDFS-15308.001.patch > > > In HDFS-14353, TestReconstructStripedFile.testNNSendsErasureCodingTasks > failed once due to pending reconstruction timeout as follows. > {code} > java.lang.AssertionError: Found 4 timeout pending reconstruction tasks > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:502) > at > org.apache.hadoop.hdfs.TestReconstructStripedFile.testNNSendsErasureCodingTasks(TestReconstructStripedFile.java:458) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} > The error occurred on the following assertion. > {code} > // Make sure that all pending reconstruction tasks can be processed. > while (ns.getPendingReconstructionBlocks() > 0) { > long timeoutPending = ns.getNumTimedOutPendingReconstructions(); > assertTrue(String.format("Found %d timeout pending reconstruction tasks", > timeoutPending), timeoutPending == 0); > Thread.sleep(1000); > } > {code} > The failure could not be reproduced in the reporter's docker environment > (start-build-environment.sh). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15351: - Attachment: HDFS-15351.001.patch Status: Patch Available (was: Open) > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15351.001.patch > > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
[ https://issues.apache.org/jira/browse/HDFS-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15351: - Description: On truncate and append we remove the blocks from Reconstruction Queue On removing the blocks from pending reconstruction , we need to decrement Blocks Scheduled > Blocks Scheduled Count was wrong on Truncate > - > > Key: HDFS-15351 > URL: https://issues.apache.org/jira/browse/HDFS-15351 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > On truncate and append we remove the blocks from Reconstruction Queue > On removing the blocks from pending reconstruction , we need to decrement > Blocks Scheduled -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15351) Blocks Scheduled Count was wrong on Truncate
hemanthboyina created HDFS-15351: Summary: Blocks Scheduled Count was wrong on Truncate Key: HDFS-15351 URL: https://issues.apache.org/jira/browse/HDFS-15351 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15335) Report top N metrics for files in get listing ops
[ https://issues.apache.org/jira/browse/HDFS-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100399#comment-17100399 ] hemanthboyina commented on HDFS-15335: -- Hi [~csun] it's a good improvement Are you started working on this ? , if not can I take over > Report top N metrics for files in get listing ops > - > > Key: HDFS-15335 > URL: https://issues.apache.org/jira/browse/HDFS-15335 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, metrics >Reporter: Chao Sun >Priority: Major > > Currently HDFS has {{filesInGetListingOps}} metrics which tells the total > number of files in all listing ops. However, it will be useful to report the > top N users who contribute most to this. This can help to identify the > potential bad users and stop the abusing against NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
[ https://issues.apache.org/jira/browse/HDFS-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15332: - Description: On calculating space quota usage {code:java} if (file.getBlocks() != null) { allBlocks.addAll(Arrays.asList(file.getBlocks())); } if (removed.getBlocks() != null) { allBlocks.addAll(Arrays.asList(removed.getBlocks())); } for (BlockInfo b: allBlocks) { {code} we missed out the blocks of file snapshot feature's Diffs > Quota Space consumed was wrong in truncate with Snapshots > - > > Key: HDFS-15332 > URL: https://issues.apache.org/jira/browse/HDFS-15332 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15332.001.patch > > > On calculating space quota usage > {code:java} >if (file.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(file.getBlocks())); >} >if (removed.getBlocks() != null) { > allBlocks.addAll(Arrays.asList(removed.getBlocks())); >} >for (BlockInfo b: allBlocks) { {code} > we missed out the blocks of file snapshot feature's Diffs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
[ https://issues.apache.org/jira/browse/HDFS-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15332: - Attachment: HDFS-15332.001.patch Status: Patch Available (was: Open) > Quota Space consumed was wrong in truncate with Snapshots > - > > Key: HDFS-15332 > URL: https://issues.apache.org/jira/browse/HDFS-15332 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15332.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15332) Quota Space consumed was wrong in truncate with Snapshots
hemanthboyina created HDFS-15332: Summary: Quota Space consumed was wrong in truncate with Snapshots Key: HDFS-15332 URL: https://issues.apache.org/jira/browse/HDFS-15332 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097474#comment-17097474 ] hemanthboyina commented on HDFS-15316: -- thanks [~ayushtkn] for review updated the patch , please review {quote}Secondly, can you help when in actual scenario {quote} it happened in very rare scenario , though it can be a safeguard condition to check > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch > > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15316: - Attachment: HDFS-15316.002.patch > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch > > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15316: - Attachment: HDFS-15316.001.patch Status: Patch Available (was: Open) > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15316.001.patch > > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15316: - Description: If deleting a directory doesn't succeeds , still we are removing directory from snapshottables this makes the system inconsistent , we will be able to create snapshots but snapshot diff throws Directory is not snaphottable > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15302) Backport HDFS-15286 to branch-2.x
[ https://issues.apache.org/jira/browse/HDFS-15302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096878#comment-17096878 ] hemanthboyina commented on HDFS-15302: -- thanks [~aajisaka] for the review updated the patch , please review , test failures and findbugs were not related > Backport HDFS-15286 to branch-2.x > - > > Key: HDFS-15302 > URL: https://issues.apache.org/jira/browse/HDFS-15302 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Akira Ajisaka >Assignee: hemanthboyina >Priority: Blocker > Attachments: HDFS-15302-branch-2.10-01.patch, > HDFS-15302-branch-2.10-02.patch, HDFS-15302-branch.2.10.1.patch > > > Backport HDFS-15286 to branch-2.10 and branch-2.9. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15316) Deletion failure should not remove directory from snapshottables
hemanthboyina created HDFS-15316: Summary: Deletion failure should not remove directory from snapshottables Key: HDFS-15316 URL: https://issues.apache.org/jira/browse/HDFS-15316 Project: Hadoop HDFS Issue Type: Bug Reporter: hemanthboyina Assignee: hemanthboyina -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15302) Backport HDFS-15286 to branch-2.x
[ https://issues.apache.org/jira/browse/HDFS-15302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15302: - Attachment: HDFS-15302-branch-2.10-02.patch > Backport HDFS-15286 to branch-2.x > - > > Key: HDFS-15302 > URL: https://issues.apache.org/jira/browse/HDFS-15302 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Akira Ajisaka >Assignee: hemanthboyina >Priority: Blocker > Attachments: HDFS-15302-branch-2.10-01.patch, > HDFS-15302-branch-2.10-02.patch, HDFS-15302-branch.2.10.1.patch > > > Backport HDFS-15286 to branch-2.10 and branch-2.9. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15265) HttpFS: validate content-type in HttpFSUtils
[ https://issues.apache.org/jira/browse/HDFS-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095734#comment-17095734 ] hemanthboyina commented on HDFS-15265: -- updated the patch , please review [~elgoiri] > HttpFS: validate content-type in HttpFSUtils > > > Key: HDFS-15265 > URL: https://issues.apache.org/jira/browse/HDFS-15265 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15265.001.patch, HDFS-15265.002.patch > > > Validate that the content-type in HttpFSUtils is JSON. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org