[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562354#comment-17562354 ] zhengchenyu commented on HDFS-14703: Is this ongoing? It is a great work indeed. But I doubt that why not hold read lock in document and fgl branch? When write /a/b/c, I think we need hold the read lock of /a and /a/b, then hold the write lock of /a/b/c. If some write operation on /a/b happen in same time, they may result to inconsistence. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413096#comment-17413096 ] JiangHua Zhu commented on HDFS-14703: - Okay, I will continue to work. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413087#comment-17413087 ] Renukaprasad C commented on HDFS-14703: --- Thanks [~jianghuazhu] for your interest & attention on this task. Yes, we need to make it configurable. Didnt pay much attention to it in the POC. It will be great if you can trace this issue. Also, i suggest to make partition count - INodeMap#NUM_RANGES_STATIC configurable along with DEPTH. "By the way, in our cluster, there are more than 100 million INodes." – We have tried upto 10M files/Dirs. Larger the data set, could see the better results. You can share us reports in case you have done benchmarking with FGL branch. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412908#comment-17412908 ] JiangHua Zhu commented on HDFS-14703: - Thanks [~prasad-acit] for sharing. Yes, I have browsed through the design documents, which is very good. I think INodeMap#NAMESPACE_KEY_DEPTH should be configurable, which is conducive to the management of the cluster. (If necessary, I can create a jira) By the way, in our cluster, there are more than 100 million INodes. So I put forward this idea. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412801#comment-17412801 ] Renukaprasad C commented on HDFS-14703: --- Thanks [~jianghuazhu] for sharing your thoughts. Hope this will clarify your doubts. INodeMap#NAMESPACE_KEY_DEPTH is desighed with flexibility. Yes, by default it is 2 which is cobmination of (ParentINodeId, INodeId). When you set it to 3, then GrandParentId as well. We have tried upto level 3 with basic functionality. But performance not measured. We continued to use with the default value - 2. I am not sure of any use case to increase the values to higher number (Atleast i havent done any testing on this part). By default each partition capacity is 117965 (65536 * 1.8), we continue to use the default values in our test. We also checked the scenarios when dynamic partitions were added. No perf degrade on dynamic partitions, infact this is expected to give higher throuput. We havent noticed very high CPU usage upto 1M file write Ops (Resouce usage statistics we need to capture yet with base & FGL Patch), so this shouldnt have any impact of the other operations (RPC or any other server side processing tasks). In case if you have missed the design please go through the latest desigh doc - NameNode Fine-Grained Locking.pdf [~shv] [~xinglin] Would you like to share your inputs? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412536#comment-17412536 ] JiangHua Zhu commented on HDFS-14703: - [~prasad-acit], I have some very curious questions. The first one is: I see that INodeMap#NAMESPACE_KEY_DEBTH is a fixed value, and the default is 2. What happens if the value is 4 or 5? What I can think of is that this will affect the range of INodes allocated. The second is: If the value of INodeMap#NUM_RANGES_STATIC is greater than 256, the parallelism of processing and writing data will increase, which will affect the performance of RPC? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410871#comment-17410871 ] JiangHua Zhu commented on HDFS-14703: - Okay, I get it. Thanks [~prasad-acit] for the comment. [~weichiu], please pay attention to this comment, I hope it will help the fgl branch. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410702#comment-17410702 ] Renukaprasad C commented on HDFS-14703: --- [~jianghuazhu] Initially there are 2 commits done as part of POC in the beginning. INodeMap with PartitionedGSet and per-partition locking (This will map to Jira - HDFS-14734 & HDFS-14732). [FGL] Introduce INode key. (This will map to Jira - HDFS-14733) > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410606#comment-17410606 ] JiangHua Zhu commented on HDFS-14703: - I noticed that the file PartitionedGSet.java has some modifications, which were added by submitting "Add namespace key for INode. (shv)". https://github.com/apache/hadoop/commit/455e8c019184d5d3ae7bcff4d29d9baa7aff3663 The submission message does not have a jira id, and I cannot find a jira with the same abstract. It's a bit difficult to track current progress. I am here only as a reminder, if what I say here is wrong, I will correct it. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390818#comment-17390818 ] Konstantin Shvachko commented on HDFS-14703: Some thoughts on [~daryn]'s comment: * For small clusters/namespaces you don't need to do anything at all, performance should be great. * 1 billion object namespaces can be effectively handled with Observers (HDFS-12943), as described in our [Exabyte Club blog|https://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr]. * This namespace partitioning idea should help if you want to grow the workloads and cluster size further. And sure, it's a big "if" there. * There is plenty of benchmark data above. I built the POC exactly with the purpose to obtain some preliminary synthetic numbers. For me 30% is a threshold separating worthy improvements. * We won't know the real performance numbers until the feature is done. As with "Consistent Reads from Standby", our initial synthetic benchmarks showed ~50% improvement. The real numbers in production were 3x better in both average throughput and latency. * You bring up good design concerns. But conceptually multiple partitions cannot be worse than the single. When an operation spans all partitions, its like taking a global lock as we do today. So in this case the performance of multiple partitions degenerates to the current level, but in all other cases multiple namespace operations can go in parallel. * Let us know if you have concrete suggestions: you don't want it to sound like FUD. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388217#comment-17388217 ] Renukaprasad C commented on HDFS-14703: --- Thanks [Daryn Sharp|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=daryn] for the review & comments. Thanks [Xing Lin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=xinglin] for quick update. # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? Relevant since end-to-end benchmarking rarely matches microbenchmarks. We have run the benchmarking took in standalone mode with file:// schema. With this we would be able to achieve 50k-60k throughput. # What is “30-40%” improvement? How many ops/sec before and after? When we test in standalone mode, we found an average of 30% improvement with mkdir op. https://issues.apache.org/jira/browse/HDFS-14703?focusedCommentId=17346002=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346002 # What impact did it have on gc/min and gc time? These are often hidden killers of performance when not taken into consideration. We have noticed that there is no CPU bottleneck with the patch. These metrics we need to capture yet. We shall check further and publish if any impact on GC with the patch. We would like [~shv] to clarify further. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797 ] Xing Lin commented on HDFS-14703: - [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387540#comment-17387540 ] Daryn Sharp commented on HDFS-14703: I applaud this effort but I have concerns about real world scalability. This may positively affect a synthetic benchmark, small clusters with a small namespace, or a presumed deterministic data creation rate and locality, but…. How will this scale to namespaces up to almost 1 billion total objects with 250-500 million inodes performing an avg 20-40k ops/sec with occasional spikes exceeding 100k ops/sec, with up to 600 active applications? {quote}entries of the same directory are partitioned into the same range most of the time. “Most of the time” means that very large directories containing too many children or some of boundary directories can still span across multiple partitions. {quote} How many is “too many children”? It’s not uncommon for directories to contain thousands or even tens/hundreds of thousands of files. Jobs with large dags that run for minutes, hours, or days seem likely to result to violate “most of the time” and create high fragmentation across partitions. Task time shift caused by queue resource availability, speculation, preemption, etc will violate the premise of inodes neatly clustering into partitions based on creation time. How will this handle things like IBR processing which includes many blocks spread across multiple partitions? Especially during a replication flurry caused by rack loss or multi-rack decommissioning (over a hundred hosts)? How will live lock conditions be resolved that result from multiple ops needing to lock multiple overlapping partitions? Managing that dependency graph might wipe out real world improvements at scale. {quote}An analysis shows that the two locks are held simultaneously most of the time, making one of them redundant. We suggest removing the FSDirectoryLock. {quote} Please do not remove the fsdir lock. While the fsn and fsd lock are generally redundant, we have internal changes for operations like the horribly expensive content summary to not hold the fsn lock after resolving the path. I’ve been investigating whether some other operations can safely release the fsn lock or downgrade from a fsn write to read lock after acquiring the fsd lock. {quote}Particularly, this means that each BlocksMap partition contains all blocks of files in the corresponding INodeMap partition. {quote} If the namespace is “unexpectedly” fragmented across multiple partitions per above, what further effect will this have on data skew (blocks per files) in the partition? Users generate an assortment of relatively small files plus multi-GB or TB files. A directory tree may contain combinations of dirs containing a mixture of anything from minutes/hourly/daily/weekly/monthly rollup data. This sort of partitioning seems likely to result in further lopsided partitioning within the blocks map? {quote}We ran NNThroughputBenchmark for mkdir() operation creating 10 million directories with 200 concurrent threads. The results show 1. 30-40% improvement in throughput compared to current INodeMap implementation {quote} Can you please provide more context? # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? Relevant since end-to-end benchmarking rarely matches microbenchmarks. # What is “30-40%” improvement? How many ops/sec before and after? # Did the threads create dirs in a partition-friendly order? As in sequential creation under the same dir trees? # What impact did it have on gc/min and gc time? These are often hidden killers of performance when not taken into consideration. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380757#comment-17380757 ] Konstantin Shvachko commented on HDFS-14703: ??Shall I raise separate Jira for Create and trace the PR??? Yes please let's track {{create}} in a new jira. You can make it a subtask of this jira and follow [the standard process|https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute]. ??Just provided work-around to continue, we shall work on it and eventually optimize it better.?? It is fine as a work around, but yes we should and it would be good to design it early, as it may effect the structure of the entire implementation. A short design doc on the subject would be nice to have if you got any ideas. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380424#comment-17380424 ] Renukaprasad C commented on HDFS-14703: --- Thanks [~shv] for review & feedback. Shall i raise separate Jira for Create and trace the PR? Or is ok to go with the current PR? {noformat} Noticed that you implemented getInode(id) by iterating through all inodes. This is probably the key part of this effort. We should eventually replace getInode(id) with getInode(key) to make the inode lookup efficient.{noformat} I totally agree with you, this is an overhead in finding the iNodes on large dataset. Just provided work-around to continue, we shall work on it and eventually optimize it better. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379531#comment-17379531 ] Wei-Chiu Chuang commented on HDFS-14703: Added a version "Fine-Grained Locking". Please use it for target and fix versions for anything that lands in the fgl branch. [~shv], i noticed the file LatchLock.java was added by the commit "INodeMap with PartitionedGSet and per-partition locking." https://github.com/apache/hadoop/commit/1f1a0c45fe44c3da0db9678417c4ff397a93 The commit message does not have a jira id and I couldn't find a jira with the same summary. A little hard to keep track of the current progress. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378314#comment-17378314 ] Konstantin Shvachko commented on HDFS-14703: Great progress [~prasad-acit]. It proves the concept works for creates as well. I liked that your changes are all confined in internal classes like FSDirectory. Noticed that you implemented {{getInode(id)}} by iterating through all inodes. This is probably the key part of this effort. We should eventually replace {{getInode(id)}} with {{getInode(key)}} to make the inode lookup efficient. But hey you still got 25% boost. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359651#comment-17359651 ] Xing Lin commented on HDFS-14703: - Hi [~prasad-acit], that is awesome! Konstantin is on vacation this week and next week. I am sure he will be very happy to review your pull press for Create API. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359306#comment-17359306 ] Renukaprasad C commented on HDFS-14703: --- [~shv]/ [~xinglin] We have implemented FGL for Create API and done basic testing & captured the performance reading. With the create API we could see around 25% of improvement. I have created PR - [https://github.com/apache/hadoop/pull/3013] for the same. Can you please review & feedback when you get time? Command: /hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op create -threads 200 -files 100 -filesPerDir 40 Result: ||Iteration||Heading 1||Heading 2|| |Itr-1|27124|32712| |Itr-2|26460|31312| |Itr-3|24166|32276| |Avg|25916.66|32100| |Improvement| |23.86| > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346379#comment-17346379 ] Konstantin Shvachko commented on HDFS-14703: Thanks [~prasad-acit] and [~xinglin] for benchmarking. Very glad you guys could independently confirm 30-45% improvement. I think the PartitionedGSet implementation should benefit from both *_more cores_* and *_faster storage device_* for edits. For storage device NVME SSDs perform the best for journaling type workloads in our experience. Also please take into account this is only a POC patch. Theoretically, we should be able to scale performance proportionally to the number of cores and partitions in the GSet given we are not IO bound. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346002#comment-17346002 ] Renukaprasad C commented on HDFS-14703: --- Thanks [~shv] & [~xinglin] I have tested on 48 Core physical machine & could see significant performance improvement with the patch. On average improvement is around 30% with default storage policy. BasePatch ITR-1 56439 66622 ITR-2 58092 65074 ITR-3 60132 74354 ITR-4 52056 76522 ITR-5 55478 65526 ITR-6 60664 76881 AVG 56066 72976.3 Improvement 30.16147636 Attached few results: {code:java} BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 {code} Command: ./hadoop jar ../share/hadoop/common/hadoop-hdfs-3.1.1-hw-ei-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op mkdirs -threads 200 -dirs 100 -dirsPerDir 32 Hw configuration: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit CPU(s):48 On-line CPU(s) list: 0-47 Thread(s) per core:2 Core(s) per socket:12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 63 Model name:Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 2600.406 CPU max MHz: 3500. CPU min MHz: 1200. BogoMIPS: 5189.51 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin commented on HDFS-14703: - I did some performance benchmarks using a physical server (a d430 server in [utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069 ] Xing Lin commented on HDFS-14703: - [~prasad-acit] try this command: use -fs file:///, instead of hdfs://server:port. "-fs file:///" will bypass the RPC layer and should give you higher numbers at your VM. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs file:///* -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344975#comment-17344975 ] Renukaprasad C commented on HDFS-14703: --- Thanks [~xinglin], I tried with 8 core on laptop as well as in VM. {code:java} Here is my VM Configuration: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):8 On-line CPU(s) list: 0-7 Thread(s) per core:1 Core(s) per socket:8 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family:6 Model: 62 Model name:Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz Stepping: 4 CPU MHz: 3000.079 BogoMIPS: 6000.22 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-7 Laptop: Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model: 142 Model name: Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz Stepping:11 CPU MHz: 998.040 CPU max MHz: 3900. CPU min MHz: 400. BogoMIPS:3600.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache:256K L3 cache:6144K NUMA node0 CPU(s): 0-7 {code} I got better throughput in my VM, still Ops count with & without patch changes remain same. [root@00956 bin]# ./hadoop jar ./hadoop-hdfs-3.1.1-hw-ei-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs hdfs://XX.XX.XX.XX:65110 -op mkdirs -threads 1000 -dirs 100 -dirsPerDir 128 2021-05-15 14:18:25,641 INFO namenode.NNThroughputBenchmark: Starting benchmark: mkdirs 2021-05-15 14:18:25,682 INFO namenode.NNThroughputBenchmark: Generate 100 inputs for mkdirs 2021-05-15 14:18:26,209 FATAL namenode.NNThroughputBenchmark: Log level = ERROR 2021-05-15 14:18:26,298 INFO namenode.NNThroughputBenchmark: Starting 100 mkdirs(s) with 1000 threads. 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrThreads = 1000 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: Elapsed Time: 118570 2021-05-15 14:20:25,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 8433.836552247618 2021-05-15 14:20:25,476 INFO namenode.NNThroughputBenchmark: Average Time: 116 I will also try to test on some high end environment. Could you share me the command you run and the partition size you have set? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344970#comment-17344970 ] Xing Lin commented on HDFS-14703: - [~prasad-acit] how many CPU cores does your server have? The OPS per sec seems rather low, than I got from my Mac laptop (with 8 cores). fgl gives us 10% improvement running on my Mac. We will find some proper hardware to do more serial performance benchmarks. *Trunk* 021-05-11 09:52:35,666 INFO namenode.NNThroughputBenchmark: 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Elapsed Time: 542905 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Ops per sec: 18419.42881351249 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Average Time: 10 2021-05-11 09:52:35,667 INFO namenode.FSEditLog: Ending log segment 5488830, 10019538 2021-05-11 09:52:35,670 INFO namenode.FSEditLog: Number of transactions: 4530710 Total time for transactions(ms): 14288 Number of transactions batched in Syncs: 4452444 Number of syncs: 78267 SyncTimes(ms): 200575 *fgl* 021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: 2021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Elapsed Time: 505892 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Ops per sec: 19767.06490713433 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Average Time: 10 2021-05-11 10:58:40,143 INFO namenode.FSEditLog: Ending log segment 5826307, 10019538 2021-05-11 10:58:40,146 INFO namenode.FSEditLog: Number of transactions: 4193233 Total time f or transactions(ms): 13990 Number of transactions batched in Syncs: 4130972 Number of syncs: 62262 SyncTimes(ms): 168203 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343725#comment-17343725 ] Renukaprasad C commented on HDFS-14703: --- [~shv] Thanks for sharing the patch. I tried to test the pach applied on Trunk, results found similar with & without patch. I have attached results for both the results below. Did I miss something? With Patch: {code:java} ~/hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128 2021-05-13 01:57:41,279 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-05-13 01:57:41,976 INFO namenode.NNThroughputBenchmark: Starting benchmark: mkdirs 2021-05-13 01:57:42,065 INFO namenode.NNThroughputBenchmark: Generate 200 inputs for mkdirs 2021-05-13 01:57:43,385 INFO namenode.NNThroughputBenchmark: Log level = ERROR 2021-05-13 01:57:44,079 INFO namenode.NNThroughputBenchmark: Starting 200 mkdirs(s). 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: Elapsed Time: 1095122 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: Ops per sec: 1826.2805422592187 2021-05-13 02:15:59,959 INFO namenode.NNThroughputBenchmark: Average Time: 108 {code} Without Patch: {code:java} /hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128 2021-05-13 03:25:53,243 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-05-13 03:25:54,046 INFO namenode.NNThroughputBenchmark: Starting benchmark: mkdirs 2021-05-13 03:25:54,117 INFO namenode.NNThroughputBenchmark: Generate 200 inputs for mkdirs 2021-05-13 03:25:55,076 INFO namenode.NNThroughputBenchmark: Log level = ERROR 2021-05-13 03:25:55,163 INFO namenode.NNThroughputBenchmark: Starting 200 mkdirs(s). 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Elapsed Time: 1064420 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Ops per sec: 1878.9575543488472 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Average Time: 105 {code} Similar results achived with when i tried with "file" as well, but this case Partitions were empty. {code:java} 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Elapsed Time: 845625 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Ops per sec: 2365.1145602365114 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Average Time: 84 2021-05-13 09:20:36,922 INFO namenode.FSEditLog: Ending log segment 1465676, 2015633 2021-05-13 09:20:36,987 INFO namenode.FSEditLog: Number of transactions: 549959 Total time for transactions(ms): 2840 Number of transactions batched in Syncs: 545346 Number of syncs: 4614 SyncTimes(ms): 240432 2021-05-13 09:20:36,996 INFO namenode.FileJournalManager: Finalizing edits file /home/renu/hadoop-3.4.0-SNAPSHOT/hdfs/namenode/current/edits_inprogress_1465676 ->
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341089#comment-17341089 ] Konstantin Shvachko commented on HDFS-14703: Updated the POC patches. There were indeed some missing parts in the first patch. See [https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz|https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz]. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321979#comment-17321979 ] Renukaprasad C commented on HDFS-14703: --- [~shv] Thanks for sharing design and the patch. There are some files missing in 002-partitioned-inodeMap-POC.tar.gz. Are these changes intended? or your POC test done on 001-partitioned-inodeMap-POC.tar.gz patch only? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, > NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279473#comment-17279473 ] Hui Fei commented on HDFS-14703: [~shv] Great feature and look forward to it, Is it in progress? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, > NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213932#comment-17213932 ] Hemanth Boyina commented on HDFS-14703: --- thanks [~shv] for your work gone through the design doc , it was great , i have some questions 1) It is clear that on startup we decide the number of partitions based on number of inodes in the image , but how do we decide range of RangeGsets on a first time installation of cluster ? 2) {quote}locking schema would be to allow Latch Lock for some operations combined with the Global Lock for the other ones {quote} for operations like mkdir some times it is required to acquire locks of different RangeGSets ,In the case of Recursive Mkdir we might need to acquire locks of different RangeGSets , if some of the RangeGsets are locked for other operations then the Range Map lock might have to wait for long time 3) {quote}I attached two remaining patches 003 and 004 that should apply to current trunk. {quote} are these patches should be applied on top of 003 and 004 in [^001-partitioned-inodeMap-POC.tar.gz] ? looks some files were missing in patches of [^002-partitioned-inodeMap-POC.tar.gz] > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, > NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191894#comment-17191894 ] Konstantin Shvachko commented on HDFS-14703: After HDFS-14731 the first two patches are already in the code. I attached two remaining patches 003 and 004 that should apply to current trunk. The intent of the patches is described in the [earlier comment|https://issues.apache.org/jira/browse/HDFS-14703?focusedCommentId=16907662=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16907662]. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, > NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188871#comment-17188871 ] Konstantin Shvachko commented on HDFS-14703: Hey guys. Glad to hear of your interest in this issue. The initial set of patches was on top of trunk at some point before HDFS-14731. Let me try to update the remaining patches for current trunk. Will take some time, though. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187451#comment-17187451 ] Xiaoqiao He commented on HDFS-14703: cc [~shv]. {quote}I want to do some work on this issue ,could you which version does the patch base on?thanks{quote} Thanks involve me here. As I know, only sub-task HDFS-14731 has merge to trunk, others do not commit yet. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187409#comment-17187409 ] junbiao chen commented on HDFS-14703: - Which version does the patch base on?thanks > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032095#comment-17032095 ] Ayush Saxena commented on HDFS-14703: - Thanx [~shv] for the design, Overall design seems interesting. One doubt : {quote} For general case renames, which include moving a large directory ubder another parent, could require locking multiple partitions. In the worst case it could be equivalent (in performance) to holding a global lock {quote} As you said in case of renames it may require locking up multiple partitions, In case of heavy loads, is it possible that a rename call may get stuck as it isn't able to grab the locks on multiple partitions at one time? One of the partition being alternatively always held? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955302#comment-16955302 ] Konstantin Shvachko commented on HDFS-14703: Updated the design doc. Added a picture and some details about the locking schema, BlocksMap partitioning including block report processing, that have already been discussed in the jira. Hope it clarifies some things. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923782#comment-16923782 ] Konstantin Shvachko commented on HDFS-14703: Good questions guys, thanks. ??how to handle block reports??? Yes the blocks are partitioned based on the INodeMap partitions. Each range in INodeMap forms a GSet in the BlocksMap, which contains all blocks belonging to the files in the given range of inodes. A more formal way of defining partitions is to say that _blockKey = _ and the partitioning key ranges for blocks are the same as for INodes. Block report processing is per storage. My first thought was to process a storage report under the global lock (RangeMap lock), which is no worse than today. We can further optimize this by splitting the report into INode ranges first and then processing them concurrently. The details may be tricky, as anything concerning block reports. ??if I hold a Range Map lock, does it mean that I can operate safely??? You should be. The RangeMap lock is like the global lock, because everybody has to enter it first thing for any operation. One still need to check RangeGSet lock in case somebody is still modifying this GSet, but new threads cannot enter since they will be blocked on obtaining the RangeMap lock. ??is it possible that Range Map lock might have to wait a really long time for the Range Set locks to be released??? Not really. You grab the RangeMap lock as soon as you can. Then proceed into RangeGSet once nobody else has the lock on it. GSet locks should drain pretty fast since nobody new is entering. As I mentioned in the document, locking schema needs a separate detailed design. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922937#comment-16922937 ] Anu Engineer commented on HDFS-14703: - [~shv] We ([~arp] , [~xyao] , [~jojochuang] , [~szetszwo] ) were looking at the patch, as well as the document and came across some questions that we were not able to answer. I have been tasked with asking these. # The Block Partition - We understand that you are proposing the block partitions be divided into GSets that match the Inode partition. What we could not puzzle out was how to handle block reports? One of the suggestions we came up with was the in the initial parts of the work, we leave the block map as a single monolith. It would be interesting to hear how you plan to partition the block map, especially when the block reports are involved. # The locks in the Range Map Lock and Range Set lock– It is not very clear what the semantics would be, if I hold a Range Map lock, does it mean that I can operate safely? what happens to the Range Set Locks? Do I need to make sure that all users of RangeSet has released the locks ? and if I am holding the Range Map lock, no other thread will be able to enter? is it possible that Range Map lock might have to wait a really long time for the Range Set locks to be released ? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922886#comment-16922886 ] Konstantin Shvachko commented on HDFS-14703: Hey [~arp] , thanks for reviewing. ??Do atomic rename and snapshots still work as before with these changes??? Yes, the intention is to support atomic renames, snapshots, and all features. For general case renames, which include moving a large directory ubder another parent, could require locking multiple partitions. In the worst case it could be equivalent (in performance) to holding a global lock, because all partitions will be locked. But more frequent small operations will be faster. We [did discuss snapshots in this regard|https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit#]. Doesn't look impossible, needs some thinking around copy-on-write cases. ??Did you measure write throughput improvement with dfs.namenode.edits.asynclogging??? Yes, I used the default value {{dfs.namenode.edits.asynclogging = true}} > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922880#comment-16922880 ] Konstantin Shvachko commented on HDFS-14703: Hey [~hexiaoqiao], clarifying on your questions. # The POC patches use latch lock only for one operation - mkdir. All other operations are unchanged and use the global lock. So concurrency in POC is guaranteed only for concurrent mkdir operations. If you use delete (or any other op) and mkdir concurrently the results will be unpredictable exactly as you describe. The POC goal is to demonstrate the idea, it is not the final product. ??`deleting a directory should lock all RangeGets involved`. Is it one special case about Delete Ops??? Not only directory deletes. Several operations may need to lock multiple RangeGets like rename, recursive mkdir. # The POC patch adds {{long[] namespaceKey}} field into INode, which would increase the footprint of the namespace, which is bad. {{namespaceKey}} not really needed, as one can always calculate the the key via {{parent}} reference. It's an optimization. An alternative is to move {{long[]}} into {{INodesInPath}} so that they exist only when the INode is accessed. Again POC does not do A LOT of things, which the final implementation should. It's a large project, please don't blame me that I didn't do everything already ;). # Actually there is unlock for mkdir, otherwise the POC wouldn't work. {{FSNamesystemLock.writeUnlock()}} unlocks all locked children when {{unlockChildren == true}}. [~hexiaoqiao] looking forward working with you on this feature. Any and all help is very much welcomed. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919948#comment-16919948 ] Arpit Agarwal commented on HDFS-14703: -- Interesting proposal [~shv] . Thanks for sharing this and the PoC patch. I went through the doc and the idea seems interesting. I didn't understand how the partitioning scheme works. Do atomic rename and snapshots still as before with these changes? Did you measure write throughput improvement with {{dfs.namenode.edits.asynclogging}}? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909681#comment-16909681 ] He Xiaoqiao commented on HDFS-14703: Thanks [~shv] for your POC patches. I have to state that this is very clever design for fine-grained global locking. There are still couple of questions what I do not quite understand and look forward to your response. 1. Write concurrency control. Consider one case with two threads with mkdir (/a/b/c/d/e) and delete(/a/b/c) ops. I try to ran this case following design and POC patches, but I usually get unstable result since key with and could be located at different RangeGSet using {{INodeMap#latchWriteLock}}, then the two threads could run concurrently and get unstable result even if from one client and one by one. As your last explains, `deleting a directory should lock all RangeGets involved`. Is it one special case about Delete Ops? Sorry for asking this question again. {quote} Deleting a directory /a/b/c means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing file f. So f cannot be modified concurrently with the delete. {quote} 2. {{INode}} involves local variable {{long[] namespaceKey}} at 0004 in POC package. I believe this attributes is very useful to partition for INode. meanwhile does it bring some other potential issues * heap footprint overhead. For a long while running of NameNode process, namespaceKey of most INode (visited once at least) in the directory tree may be not null. If we consider there are 500M INodes and {{level}} is both 2, it need over than 8GB heap size. * when one INode is renamed, the {{namespaceKey}} have to update, right? Since its parent INode has changes. POC seems not update anymore if {{namespaceKey}} is not null. Is it possible to calculate namespaceKey for INode when use it out of the Lock. Of course, it will bring CPU overhead. Please correct me if I am wrong. Thanks. 3. No LatchLock unlock in the POC for operation #mkdir, it seems like a bit of oversight. In my opinion, it has to release childLock after used, right? [~shv] Thanks for your POC patches again and looks forward to the next milestone. And I would like to involve to push forward this feature if need. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907662#comment-16907662 ] Konstantin Shvachko commented on HDFS-14703: Attaching the POC patch. It consists of 4 commits. Apply using {{git am 001-partitioned-inodeMap-POC/*}} command. # 0001 patch is an investigation to verify that FSN lock is used together with dirLock. I just ran unit tests with this patch. Most of them pass the verification, but some don't. # 0002 patch disables dirLock. # 0003 introduces PartitionedGSet, LatchLock. It implements dynamic partitioning based on inodeId key (see INodeIdComparator) # 0004 introduces two-level key and implements static partitioning based on that key. With 0003 and 0004 patches I ran NNThroughputBenchmark creating 2 million directories with 200 concurrent threads and 128 subdirectories. So the POC implements new locking for one operation only - mkdir. The benchmark command: {{NNThroughputBenchmark -fs file:/// -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128}} > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903374#comment-16903374 ] Konstantin Shvachko commented on HDFS-14703: Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions: # "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have keys, respectively, {{}} and {{}}, which have common prefix {{}} and therefore are likely to fall into the same RangeGSet. In your example {{}} is the parent of {{}} and this key definition does not guarantee them to be in the same range. # Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing containing file {{f}}. So {{f}} cannot be modified concurrently with the delete. # Just to clarify RangeMap is the upper level part of PartitionedGSet, which maps key ranges into RangeGSets. So there is only one RangeMap and many RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You make a good point that some operations like failover, large deletes, renames, quota changes will still require a global lock. The lock on RangeMap could play the role of such global lock. This should be defined in more details within the design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock for some operations. This will also help us gradually switch operations from FSNamesystemLock to LatchLock. # I don't know what the next bottleneck we will see, but you are absolutely correct there will be something. For edits log, I indeed saw while running my benchmarks that the number of transactions batched together while journaling was increasing. This is expected and desirable behavior, since writing large batches to a disk is more efficient than lots of small writes. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901887#comment-16901887 ] He Xiaoqiao commented on HDFS-14703: Thanks [~shv] for file this JIRA and plan to push this feature forward, it is very great work. Really appreciate doing this. There are some details I am confused after reading the design document. As design document said, each inode maps (through inode key) to one RangeMap who has a separate lock and carry out concurrently. {quote}The inode key is a fixed length sequence of parent inodeids ending with the file inode id itself: key(f) = Where selfId is the inodeId of file f, pId is the id of its parent, and ppId is the id of the parent of the parent. Such definition of a key guarantees that not only siblings but also cousins (objects having the same grandparent) are partitioned into the same range most of the time {quote} Consider the following path: /a/b/c/d/e, corresponding inode id is [ida, idb, idc, idd]. 1. How we could guarantee to map 'cousins' into the same range? In my first opinion, it could map to different RangeMaps, since for idc, its inode key = and for idd its inode key = . 2. Any consideration about operating one nodes and its ancestor node concurrently? for instance, /a/b/c/d/e/f, we could delete inode c and modify inode f at the same time if they map to different range since we do not guarantee map them to the same one. maybe it is problem in the case. 3. Which lock will be hold if request some global request like ha failover, safemode etc.? do we need to obtain all RangeMap lock? 4. Any bottleneck meet after improve write throughput, I believe that EditLog OPS will keep increase, and will it to be the new bottleneck? Please correct me if I do not understand correctly. Thanks. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900526#comment-16900526 ] Konstantin Shvachko commented on HDFS-14703: Attached the design document for review. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org