[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2022-07-04 Thread zhengchenyu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562354#comment-17562354
 ] 

zhengchenyu commented on HDFS-14703:


Is this ongoing? It is a great work indeed. But I doubt that why not hold read 
lock in document and fgl branch?  When write /a/b/c, I think we need hold the 
read lock of /a and /a/b, then hold the write lock of /a/b/c. If some write 
operation on /a/b happen in same time, they may result to inconsistence.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-10 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413096#comment-17413096
 ] 

JiangHua Zhu commented on HDFS-14703:
-

Okay, I will continue to work.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-10 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413087#comment-17413087
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [~jianghuazhu] for your interest & attention on this task.

Yes, we need to make it configurable. Didnt  pay much attention to it in the 
POC. It will be great if you can trace this issue.

Also, i suggest to make partition count - INodeMap#NUM_RANGES_STATIC 
configurable along with DEPTH. 

"By the way, in our cluster, there are more than 100 million INodes."

  – We have tried upto 10M files/Dirs. Larger the data set, could see the 
better results.  You can share us reports in case you have done benchmarking 
with FGL branch.

 

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-09 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412908#comment-17412908
 ] 

JiangHua Zhu commented on HDFS-14703:
-

Thanks [~prasad-acit] for sharing.
Yes, I have browsed through the design documents, which is very good.
I think INodeMap#NAMESPACE_KEY_DEPTH should be configurable, which is conducive 
to the management of the cluster. (If necessary, I can create a jira)
By the way, in our cluster, there are more than 100 million INodes.
So I put forward this idea.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-09 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412801#comment-17412801
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [~jianghuazhu] for sharing your thoughts. Hope this will clarify your 
doubts. 

INodeMap#NAMESPACE_KEY_DEPTH is desighed with flexibility. Yes, by default it 
is 2 which is cobmination of (ParentINodeId, INodeId). When you set it to 3, 
then GrandParentId as well.  We have tried upto level 3 with basic 
functionality. But performance not measured. We continued to use with the 
default value - 2. I am not sure of any use case to increase the values to 
higher number (Atleast i havent done any testing on this part).

By default each partition capacity is 117965 (65536 * 1.8), we continue to use 
the default values in our test. We also checked the scenarios when dynamic 
partitions were added. No perf degrade on dynamic partitions, infact this is 
expected to give higher throuput. We havent noticed very high CPU usage upto 1M 
file write Ops (Resouce usage statistics we need to capture yet with base & FGL 
Patch), so this shouldnt have any impact of the other operations (RPC or any 
other server side processing tasks). 

In case if you have missed the design please go through the latest desigh doc - 
NameNode Fine-Grained Locking.pdf

[~shv] [~xinglin] Would you like to share your inputs?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-09 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17412536#comment-17412536
 ] 

JiangHua Zhu commented on HDFS-14703:
-

[~prasad-acit], I have some very curious questions.
The first one is:
I see that INodeMap#NAMESPACE_KEY_DEBTH is a fixed value, and the default is 2.
What happens if the value is 4 or 5?
What I can think of is that this will affect the range of INodes allocated.
The second is:
If the value of INodeMap#NUM_RANGES_STATIC is greater than 256, the parallelism 
of processing and writing data will increase, which will affect the performance 
of RPC?


> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-06 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410871#comment-17410871
 ] 

JiangHua Zhu commented on HDFS-14703:
-

Okay, I get it. Thanks [~prasad-acit] for the comment.
[~weichiu], please pay attention to this comment, I hope it will help the fgl 
branch.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-06 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410702#comment-17410702
 ] 

Renukaprasad C commented on HDFS-14703:
---

[~jianghuazhu] Initially there are 2 commits done as part of POC in the 
beginning.

INodeMap with PartitionedGSet and per-partition locking (This will map to Jira 
- HDFS-14734 & HDFS-14732).

[FGL] Introduce INode key. (This will map to Jira - HDFS-14733)

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-09-06 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410606#comment-17410606
 ] 

JiangHua Zhu commented on HDFS-14703:
-

I noticed that the file PartitionedGSet.java has some modifications, which were 
added by submitting "Add namespace key for INode. (shv)".
https://github.com/apache/hadoop/commit/455e8c019184d5d3ae7bcff4d29d9baa7aff3663
The submission message does not have a jira id, and I cannot find a jira with 
the same abstract. It's a bit difficult to track current progress.
I am here only as a reminder, if what I say here is wrong, I will correct it.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-30 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390818#comment-17390818
 ] 

Konstantin Shvachko commented on HDFS-14703:


Some thoughts on [~daryn]'s comment:
* For small clusters/namespaces you don't need to do anything at all, 
performance should be great.
* 1 billion object namespaces can be effectively handled with Observers 
(HDFS-12943), as described in our [Exabyte Club 
blog|https://engineering.linkedin.com/blog/2021/the-exabyte-club--linkedin-s-journey-of-scaling-the-hadoop-distr].
* This namespace partitioning idea should help if you want to grow the 
workloads and cluster size further. And sure, it's a big "if" there.
* There is plenty of benchmark data above. I built the POC exactly with the 
purpose to obtain some preliminary synthetic numbers. For me 30% is a threshold 
separating worthy improvements.
* We won't know the real performance numbers until the feature is done. As with 
"Consistent Reads from Standby", our initial synthetic  benchmarks showed ~50% 
improvement. The real numbers in production were 3x better in both average 
throughput and latency.
* You bring up good design concerns. But conceptually multiple partitions 
cannot be worse than the single. When an operation spans all partitions, its 
like taking a global lock as we do today. So in this case the performance of 
multiple partitions degenerates to the current level, but in all other cases 
multiple namespace operations can go in parallel.
* Let us know if you have concrete suggestions: you don't want it to sound like 
FUD.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-27 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388217#comment-17388217
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [Daryn 
Sharp|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=daryn] for 
the review & comments. Thanks [Xing 
Lin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=xinglin] for 
quick update.
 # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? 
Relevant since end-to-end benchmarking rarely matches microbenchmarks.

We have run the benchmarking took in standalone mode with file:// schema. With 
this we would be able to achieve 50k-60k throughput. 
 # What is “30-40%” improvement? How many ops/sec before and after?

When we test in standalone mode, we found an average of 30% improvement with 
mkdir op.

https://issues.apache.org/jira/browse/HDFS-14703?focusedCommentId=17346002=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346002
 # What impact did it have on gc/min and gc time? These are often hidden 
killers of performance when not taken into consideration.

We have noticed that there is no CPU bottleneck with the patch. These metrics 
we need to capture yet. We shall check further and publish if any impact on GC 
with the patch.
 

We would like [~shv] to clarify further.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-27 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797
 ] 

Xing Lin commented on HDFS-14703:
-

[~daryn] Thanks for your comments. I will address your last question and leave 
other questions to [~shv]. :)

 

Regarding the results, we used the standard NNThroughputBenchmark, with 
commands like the following. 
 
./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* 
[*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

Here are a result from [~prasad-acit], since his QPS numbers are higher than 
what I got. 
BASE:
common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO 
namenode.NNThroughputBenchmark: --- mkdirs inputs ---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
56439.77875606727
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3
2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254

PATCH:
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
66622.25183211193
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2
2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-26 Thread Daryn Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387540#comment-17387540
 ] 

Daryn Sharp commented on HDFS-14703:


I applaud this effort but I have concerns about real world scalability.  This 
may positively affect a synthetic benchmark, small clusters with a small 
namespace, or a presumed deterministic data creation rate and locality, but…. 
How will this scale to namespaces up to almost 1 billion total objects with 
250-500 million inodes performing an avg 20-40k ops/sec with occasional spikes 
exceeding 100k ops/sec, with up to 600 active applications?
{quote}entries of the same directory are partitioned into the same range most 
of the time. “Most of the time” means that very large directories containing 
too many children or some of boundary directories can still span across 
multiple partitions.
{quote}
How many is “too many children”? It’s not uncommon for directories to contain 
thousands or even tens/hundreds of thousands of files.

Jobs with large dags that run for minutes, hours, or days seem likely to result 
to violate “most of the time” and create high fragmentation across partitions. 
Task time shift caused by queue resource availability, speculation, preemption, 
etc will violate the premise of inodes neatly clustering into partitions based 
on creation time.

How will this handle things like IBR processing which includes many blocks 
spread across multiple partitions? Especially during a replication flurry 
caused by rack loss or multi-rack decommissioning (over a hundred hosts)?

How will live lock conditions be resolved that result from multiple ops needing 
to lock multiple overlapping partitions? Managing that dependency graph might 
wipe out real world improvements at scale.
{quote}An analysis shows that the two locks are held simultaneously most of the 
time, making one of them redundant. We suggest removing the FSDirectoryLock.
{quote}
Please do not remove the fsdir lock. While the fsn and fsd lock are generally 
redundant, we have internal changes for operations like the horribly expensive 
content summary to not hold the fsn lock after resolving the path. I’ve been 
investigating whether some other operations can safely release the fsn lock or 
downgrade from a fsn write to read lock after acquiring the fsd lock.
{quote}Particularly, this means that each BlocksMap partition contains all 
blocks of files in the corresponding INodeMap partition.
{quote}
If the namespace is “unexpectedly” fragmented across multiple partitions per 
above, what further effect will this have on data skew (blocks per files) in 
the partition? Users generate an assortment of relatively small files plus 
multi-GB or TB files. A directory tree may contain combinations of dirs 
containing a mixture of anything from minutes/hourly/daily/weekly/monthly 
rollup data. This sort of partitioning seems likely to result in further 
lopsided partitioning within the blocks map?
{quote}We ran NNThroughputBenchmark for mkdir() operation creating 10 million 
directories with 200 concurrent threads. The results show
 1. 30-40% improvement in throughput compared to current INodeMap implementation
{quote}
Can you please provide more context?
 # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? 
Relevant since end-to-end benchmarking rarely matches microbenchmarks.
 # What is “30-40%” improvement? How many ops/sec before and after?
 # Did the threads create dirs in a partition-friendly order? As in sequential 
creation under the same dir trees?
 # What impact did it have on gc/min and gc time? These are often hidden 
killers of performance when not taken into consideration.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-14 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380757#comment-17380757
 ] 

Konstantin Shvachko commented on HDFS-14703:


??Shall I raise separate Jira for Create and trace the PR???
Yes please let's track {{create}} in a new jira. You can make it a subtask of 
this jira and follow [the standard 
process|https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute].

??Just provided work-around to continue, we shall work on it and eventually 
optimize it better.??
It is fine as a work around, but yes we should and it would be good to design 
it early, as it may effect the structure of the entire implementation. A short 
design doc on the subject would be nice to have if you got any ideas.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-14 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380424#comment-17380424
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [~shv] for review & feedback.

Shall i raise separate Jira for Create and trace the PR? Or is ok to go with 
the current PR?
{noformat}
Noticed that you implemented getInode(id) by iterating through all inodes. This 
is probably the key part of this effort. We should eventually replace 
getInode(id) with getInode(key) to make the inode lookup efficient.{noformat}
I totally agree with you, this is an overhead in finding the iNodes on large 
dataset. Just provided work-around to continue, we shall work on it and 
eventually optimize it better.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-12 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379531#comment-17379531
 ] 

Wei-Chiu Chuang commented on HDFS-14703:


Added a version "Fine-Grained Locking". Please use it for target and fix 
versions for anything that lands in the fgl branch.

[~shv], i noticed the file LatchLock.java was added by the commit "INodeMap 
with PartitionedGSet and per-partition locking." 
https://github.com/apache/hadoop/commit/1f1a0c45fe44c3da0db9678417c4ff397a93
The commit message does not have a jira id and I couldn't find a jira with the 
same summary. A little hard to keep track of the current progress.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-07-09 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17378314#comment-17378314
 ] 

Konstantin Shvachko commented on HDFS-14703:


Great progress [~prasad-acit]. It proves the concept works for creates as well.
I liked that your changes are all confined in internal classes like 
FSDirectory. Noticed that you implemented {{getInode(id)}} by iterating through 
all inodes. This is probably the key part of this effort. We should eventually 
replace {{getInode(id)}}  with {{getInode(key)}} to make the inode lookup 
efficient.
But hey you still got 25% boost.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-06-08 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359651#comment-17359651
 ] 

Xing Lin commented on HDFS-14703:
-

Hi [~prasad-acit], that is awesome! Konstantin is on vacation this week and 
next week. I am sure he will be very happy to review your pull press for Create 
API. 

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-06-08 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359306#comment-17359306
 ] 

Renukaprasad C commented on HDFS-14703:
---

[~shv]/ [~xinglin]

We have implemented FGL for Create API and done basic testing & captured the 
performance reading. With the create API we could see around 25% of improvement.

I have created PR - [https://github.com/apache/hadoop/pull/3013] for the same. 
Can you please review & feedback when you get time?

Command:

/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
file:/// -op create -threads 200 -files 100 -filesPerDir 40

Result:

 
||Iteration||Heading 1||Heading 2||
|Itr-1|27124|32712|
|Itr-2|26460|31312|
|Itr-3|24166|32276|
|Avg|25916.66|32100|
|Improvement| |23.86|

 

 

 

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-17 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346379#comment-17346379
 ] 

Konstantin Shvachko commented on HDFS-14703:


Thanks [~prasad-acit] and [~xinglin] for benchmarking. Very glad you guys could 
independently confirm 30-45% improvement.
I think the PartitionedGSet implementation should benefit from both *_more 
cores_* and *_faster storage device_* for edits. For storage device NVME SSDs 
perform the best for journaling type workloads in our experience.
Also please take into account this is only a POC patch. Theoretically, we 
should be able to scale performance proportionally to the number of cores and 
partitions in the GSet given we are not IO bound.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-17 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346002#comment-17346002
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [~shv] & [~xinglin]
I have tested on 48 Core physical machine & could see significant performance 
improvement with the patch.
On average improvement is around 30% with default storage policy.

BasePatch
ITR-1   56439   66622
ITR-2   58092   65074
ITR-3   60132   74354
ITR-4   52056   76522
ITR-5   55478   65526
ITR-6   60664   76881
AVG 56066   72976.3

Improvement 30.16147636 


Attached few results:

{code:java}
BASE:
common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO 
namenode.NNThroughputBenchmark: --- mkdirs inputs ---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
56439.77875606727
2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3
2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254

PATCH:
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
66622.25183211193
2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2
2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254

{code}

Command: ./hadoop jar 
../share/hadoop/common/hadoop-hdfs-3.1.1-hw-ei-SNAPSHOT-tests.jar 
org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op 
mkdirs -threads 200 -dirs 100 -dirsPerDir 32

Hw configuration:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
CPU(s):48
On-line CPU(s) list:   0-47
Thread(s) per core:2
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 63
Model name:Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:  2
CPU MHz:   2600.406
CPU max MHz:   3500.
CPU min MHz:   1200.
BogoMIPS:  5189.51
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47


> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-16 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870
 ] 

Xing Lin commented on HDFS-14703:
-

I did some performance benchmarks using a physical server (a d430 server in 
[utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used 
either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove 
the time used by the SSD to make each write persistent. For the RAM case, we 
observed an improvement of 45% from fine-grained locking. For the SSD case, 
fine-grained locking gives us 20% improvement.  We used an Intel SSD (model: 
SSDSC2BX200G4R).  

We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't 
know the reason for this yet. We repeated the experiment for RAMDISK for trunk 
twice to confirm the performance number.
h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos
h1. 45% improvements fgl vs. trunk
h2. trunk 

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
663510

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
15071.362

2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
710248

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
14079.5

2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14

2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 
10019540





fgl

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
445980

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
22422.530

2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8




h1. SSD, hadoop.tmp.dir=/dev/sda4
h1. 23% improvement fgl vs. trunk

trunk:

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
593839

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
16839.581

2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11

 

fgl

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 
1000

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
481269

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
20778.400

2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9

 

/dev/sda:

ATA device, with non-removable media

Model Number:       INTEL SSDSC2BX200G4R

Serial Number:      BTHC523202RD200TGN

Firmware Revision:  G201DL2D

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-15 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069
 ] 

Xing Lin commented on HDFS-14703:
-

[~prasad-acit] try this command: use -fs file:///, instead of 
hdfs://server:port. "-fs file:///" will bypass the RPC layer and should give 
you higher numbers at your VM. 

dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT

$ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark 
*-fs file:///* -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-15 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344975#comment-17344975
 ] 

Renukaprasad C commented on HDFS-14703:
---

Thanks [~xinglin],
I tried with 8 core on laptop as well as in VM.

{code:java}

Here is my VM Configuration:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):8
On-line CPU(s) list:   0-7
Thread(s) per core:1
Core(s) per socket:8
Socket(s): 1
NUMA node(s):  1
Vendor ID: GenuineIntel
CPU family:6
Model: 62
Model name:Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Stepping:  4
CPU MHz:   3000.079
BogoMIPS:  6000.22
Hypervisor vendor: Xen
Virtualization type:   full
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0-7


Laptop:
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   142
Model name:  Intel(R) Core(TM) i5-8265U CPU @ 1.60GHz
Stepping:11
CPU MHz: 998.040
CPU max MHz: 3900.
CPU min MHz: 400.
BogoMIPS:3600.00
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:6144K
NUMA node0 CPU(s):   0-7

{code}

I got better throughput in my VM, still Ops count with & without patch changes 
remain same.
[root@00956 bin]# ./hadoop jar ./hadoop-hdfs-3.1.1-hw-ei-SNAPSHOT-tests.jar 
org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
hdfs://XX.XX.XX.XX:65110 -op mkdirs -threads 1000 -dirs 100 -dirsPerDir 128
2021-05-15 14:18:25,641 INFO namenode.NNThroughputBenchmark: Starting 
benchmark: mkdirs
2021-05-15 14:18:25,682 INFO namenode.NNThroughputBenchmark: Generate 100 
inputs for mkdirs
2021-05-15 14:18:26,209 FATAL namenode.NNThroughputBenchmark: Log level = ERROR
2021-05-15 14:18:26,298 INFO namenode.NNThroughputBenchmark: Starting 100 
mkdirs(s) with 1000 threads.
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark:
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrDirs = 100
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrThreads = 1000
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: # operations: 
100
2021-05-15 14:20:25,475 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
118570
2021-05-15 14:20:25,476 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
8433.836552247618
2021-05-15 14:20:25,476 INFO namenode.NNThroughputBenchmark: Average Time: 116


I will also try to test on some high end environment. Could you share me the 
command you run and the partition size you have set?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-15 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344970#comment-17344970
 ] 

Xing Lin commented on HDFS-14703:
-

[~prasad-acit] how many CPU cores does your server have? The OPS per sec seems 
rather low, than I got from my Mac laptop (with 8 cores). fgl gives us 10% 
improvement running on my Mac. We will find some proper hardware to do more 
serial performance benchmarks. 

 
*Trunk*
021-05-11 09:52:35,666 INFO namenode.NNThroughputBenchmark:
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirs = 1000
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: # operations: 
1000
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
542905
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
18419.42881351249
2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Average Time: 10
2021-05-11 09:52:35,667 INFO namenode.FSEditLog: Ending log segment 5488830, 
10019538
2021-05-11 09:52:35,670 INFO namenode.FSEditLog: Number of transactions: 
4530710 Total time for transactions(ms): 14288 Number of transactions batched 
in Syncs: 4452444 Number of syncs: 78267 SyncTimes(ms): 200575
 
 
*fgl*
021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark:
2021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirs = 1000
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: # operations: 
1000
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
505892
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
19767.06490713433
2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Average Time: 10
2021-05-11 10:58:40,143 INFO namenode.FSEditLog: Ending log segment 5826307, 
10019538
2021-05-11 10:58:40,146 INFO namenode.FSEditLog: Number of transactions: 
4193233 Total time f
or transactions(ms): 13990 Number of transactions batched in Syncs: 4130972 
Number of syncs:
62262 SyncTimes(ms): 168203

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-13 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343725#comment-17343725
 ] 

Renukaprasad C commented on HDFS-14703:
---

[~shv] Thanks for sharing the patch.
I tried to test the pach applied on Trunk, results found similar with & without 
patch. I have attached results for both the results below. Did I miss something?

With Patch:
{code:java}
~/hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs  
org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128
2021-05-13 01:57:41,279 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2021-05-13 01:57:41,976 INFO namenode.NNThroughputBenchmark: Starting 
benchmark: mkdirs
2021-05-13 01:57:42,065 INFO namenode.NNThroughputBenchmark: Generate 200 
inputs for mkdirs
2021-05-13 01:57:43,385 INFO namenode.NNThroughputBenchmark: Log level = ERROR
2021-05-13 01:57:44,079 INFO namenode.NNThroughputBenchmark: Starting 200 
mkdirs(s).
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: 
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirs = 200
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: # operations: 
200
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
1095122
2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
1826.2805422592187
2021-05-13 02:15:59,959 INFO namenode.NNThroughputBenchmark: Average Time: 108
{code}

Without Patch:
{code:java}
/hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs  
org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128
2021-05-13 03:25:53,243 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2021-05-13 03:25:54,046 INFO namenode.NNThroughputBenchmark: Starting 
benchmark: mkdirs
2021-05-13 03:25:54,117 INFO namenode.NNThroughputBenchmark: Generate 200 
inputs for mkdirs
2021-05-13 03:25:55,076 INFO namenode.NNThroughputBenchmark: Log level = ERROR
2021-05-13 03:25:55,163 INFO namenode.NNThroughputBenchmark: Starting 200 
mkdirs(s).
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: 
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirs = 200
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: # operations: 
200
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
1064420
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
1878.9575543488472
2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Average Time: 105
{code}


Similar results achived with when i tried with "file" as well, but this case 
Partitions were empty.

{code:java}
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: 
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs 
---
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirs = 200
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrThreads = 200
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs stats  
---
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: # operations: 
200
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Elapsed Time: 
845625
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark:  Ops per sec: 
2365.1145602365114
2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Average Time: 84
2021-05-13 09:20:36,922 INFO namenode.FSEditLog: Ending log segment 1465676, 
2015633
2021-05-13 09:20:36,987 INFO namenode.FSEditLog: Number of transactions: 549959 
Total time for transactions(ms): 2840 Number of transactions batched in Syncs: 
545346 Number of syncs: 4614 SyncTimes(ms): 240432 
2021-05-13 09:20:36,996 INFO namenode.FileJournalManager: Finalizing edits file 
/home/renu/hadoop-3.4.0-SNAPSHOT/hdfs/namenode/current/edits_inprogress_1465676
 -> 

[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-05-07 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341089#comment-17341089
 ] 

Konstantin Shvachko commented on HDFS-14703:


Updated the POC patches. There were indeed some missing parts in the first 
patch. See 
[https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz|https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz].

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-04-15 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321979#comment-17321979
 ] 

Renukaprasad C commented on HDFS-14703:
---

[~shv] Thanks for sharing design and the patch.

There are some files missing in 002-partitioned-inodeMap-POC.tar.gz. Are these 
changes intended? or your POC test done on  001-partitioned-inodeMap-POC.tar.gz 
patch only?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, 
> NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2021-02-05 Thread Hui Fei (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279473#comment-17279473
 ] 

Hui Fei commented on HDFS-14703:


[~shv] Great feature and look forward to it, Is it in progress?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, 
> NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-10-14 Thread Hemanth Boyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213932#comment-17213932
 ] 

Hemanth Boyina commented on HDFS-14703:
---

thanks [~shv] for your work

gone through the design doc , it was great , i have some questions

1) It is clear that  on startup we decide the number of partitions based on 
number of inodes in the image , but how do we decide range of RangeGsets on a 
first time installation of cluster ?

2)
{quote}locking schema would be to allow Latch Lock for some operations combined 
with the Global Lock for the other ones
{quote}
for operations like mkdir  some times it is required to acquire locks of 
different RangeGSets ,In the case of Recursive Mkdir we might need to acquire 
locks of different RangeGSets , if some of the RangeGsets are locked for other 
operations then the Range Map lock might have to wait for long time 

3)
{quote}I attached two remaining patches 003 and 004 that should apply to 
current trunk.
{quote}
are these patches should be applied on top of 003 and 004 in  
[^001-partitioned-inodeMap-POC.tar.gz] ? looks some files were missing in 
patches of [^002-partitioned-inodeMap-POC.tar.gz]

 

 

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, 
> NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-09-07 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191894#comment-17191894
 ] 

Konstantin Shvachko commented on HDFS-14703:


After HDFS-14731 the first two patches are already in the code. I attached two 
remaining patches 003 and 004 that should apply to current trunk.
The intent of the patches is described in the [earlier 
comment|https://issues.apache.org/jira/browse/HDFS-14703?focusedCommentId=16907662=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16907662].

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, NameNode Fine-Grained Locking.pdf, 
> NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-09-01 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188871#comment-17188871
 ] 

Konstantin Shvachko commented on HDFS-14703:


Hey guys. Glad to hear of your interest in this issue.
The initial set of patches was on top of trunk at some point before HDFS-14731.
Let me try to update the remaining patches for current trunk. Will take some 
time, though.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-08-31 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187451#comment-17187451
 ] 

Xiaoqiao He commented on HDFS-14703:


cc [~shv].
{quote}I want to do some work on this issue ,could you which  version does the 
patch base on?thanks{quote}
Thanks involve me here. As I know, only sub-task HDFS-14731 has merge to trunk, 
others do not commit yet.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-08-30 Thread junbiao chen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187409#comment-17187409
 ] 

junbiao chen commented on HDFS-14703:
-

Which  version does the patch base on?thanks

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2020-02-06 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032095#comment-17032095
 ] 

Ayush Saxena commented on HDFS-14703:
-

Thanx [~shv] for the design, Overall design seems interesting. 

One doubt :
{quote} For general case renames, which include moving a large directory ubder 
another parent, could require locking multiple partitions. In the worst case it 
could be equivalent (in performance) to holding a global lock
{quote}
As you said in case of renames it may require locking up multiple partitions, 
In case of heavy loads, is it possible that a rename call may get stuck as it 
isn't able to grab the locks on multiple partitions at one time? One of the 
partition being alternatively always held?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-10-19 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955302#comment-16955302
 ] 

Konstantin Shvachko commented on HDFS-14703:


Updated the design doc. Added a picture and some details about the locking 
schema, BlocksMap partitioning including block report processing, that have 
already been discussed in the jira. Hope it clarifies some things.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-09-05 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923782#comment-16923782
 ] 

Konstantin Shvachko commented on HDFS-14703:


Good questions guys, thanks.

??how to handle block reports???
Yes the blocks are partitioned based on the INodeMap partitions. Each range in 
INodeMap forms a GSet in the BlocksMap, which contains all blocks belonging to 
the files in the given range of inodes. A more formal way of defining 
partitions is to say that _blockKey = _ and the 
partitioning key ranges for blocks are the same as for INodes.
Block report processing is per storage. My first thought was to process a 
storage report under the global lock (RangeMap lock), which is no worse than 
today. We can further optimize this by splitting the report into INode ranges 
first and then processing them concurrently. The details may be tricky, as 
anything concerning block reports.
??if I hold a Range Map lock, does it mean that I can operate safely???
You should be. The RangeMap lock is like the global lock, because everybody has 
to enter it first thing for any operation. One still need to check RangeGSet 
lock in case somebody is still modifying this GSet, but new threads cannot 
enter since they will be blocked on obtaining the RangeMap lock.
??is it possible that Range Map lock might have to wait a really long time for 
the Range Set locks to be released???
Not really. You grab the RangeMap lock as soon as you can. Then proceed into 
RangeGSet once nobody else has the lock on it. GSet locks should drain pretty 
fast since nobody new is entering. 

As I mentioned in the document, locking schema needs a separate detailed design.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-09-04 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922937#comment-16922937
 ] 

Anu Engineer commented on HDFS-14703:
-

[~shv] We ([~arp] , [~xyao] , [~jojochuang] , [~szetszwo] ) were looking at the 
patch, as well as the document and came across some questions that we were not 
able to answer. I have been tasked with asking these.
 # The Block Partition - We understand that you are proposing the block 
partitions be divided into GSets that match the Inode partition. What we could 
not puzzle out was how to handle block reports? One of the suggestions we came 
up with was the in the initial parts of the work, we leave the block map as a 
single monolith. It would be interesting to hear how you plan to partition the 
block map, especially when the block reports are involved.
 # The locks in the Range Map Lock and Range Set  lock– It is not very clear 
what the semantics would be, if I hold a Range Map lock, does it mean that I 
can operate safely? what happens to the Range Set Locks? Do I need to make sure 
that all users of RangeSet has released the locks ? and if I am holding the 
Range Map lock, no other thread will be able to enter?  is it possible that 
Range Map lock might have to wait a really long time for the Range Set locks to 
be released ?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-09-04 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922886#comment-16922886
 ] 

Konstantin Shvachko commented on HDFS-14703:


Hey [~arp] , thanks for reviewing.
??Do atomic rename and snapshots still work as before with these changes???
Yes, the intention is to support atomic renames, snapshots, and all features. 
For general case renames, which include moving a large directory ubder another 
parent, could require locking multiple partitions. In the worst case it could 
be equivalent (in performance) to holding a global lock, because all partitions 
will be locked. But more frequent small operations will be faster.
We [did discuss snapshots in this 
regard|https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit#].
 Doesn't look impossible, needs some thinking around copy-on-write cases.
??Did you measure write throughput improvement with 
dfs.namenode.edits.asynclogging???
Yes, I used the default value {{dfs.namenode.edits.asynclogging = true}}

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-09-04 Thread Konstantin Shvachko (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922880#comment-16922880
 ] 

Konstantin Shvachko commented on HDFS-14703:


Hey [~hexiaoqiao], clarifying on your questions.
 # The POC patches use latch lock only for one operation - mkdir. All other 
operations are unchanged and use the global lock. So concurrency in POC is 
guaranteed only for concurrent mkdir operations. If you use delete (or any 
other op) and mkdir concurrently the results will be unpredictable exactly as 
you describe. The POC goal is to demonstrate the idea, it is not the final 
product.
 ??`deleting a directory should lock all RangeGets involved`. Is it one special 
case about Delete Ops???
 Not only directory deletes. Several operations may need to lock multiple 
RangeGets like rename, recursive mkdir.
 # The POC patch adds {{long[] namespaceKey}} field into INode, which would 
increase the footprint of the namespace, which is bad. {{namespaceKey}} not 
really needed, as one can always calculate the the key via {{parent}} 
reference. It's an optimization. An alternative is to move {{long[]}} into 
{{INodesInPath}} so that they exist only when the INode is accessed.
 Again POC does not do A LOT of things, which the final implementation should. 
It's a large project, please don't blame me that I didn't do everything already 
;).
 # Actually there is unlock for mkdir, otherwise the POC wouldn't work. 
{{FSNamesystemLock.writeUnlock()}} unlocks all locked children when 
{{unlockChildren == true}}.

[~hexiaoqiao] looking forward working with you on this feature. Any and all 
help is very much welcomed.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-30 Thread Arpit Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919948#comment-16919948
 ] 

Arpit Agarwal commented on HDFS-14703:
--

Interesting proposal [~shv] . Thanks for sharing this and the PoC patch. I went 
through the doc and the idea seems interesting. I didn't understand how the 
partitioning scheme works. Do atomic rename and snapshots still as before with 
these changes?

Did you measure write throughput improvement with 
{{dfs.namenode.edits.asynclogging}}?

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-17 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909681#comment-16909681
 ] 

He Xiaoqiao commented on HDFS-14703:


Thanks [~shv] for your POC patches. I have to state that this is very clever 
design for fine-grained global locking. There are still couple of questions 
what I do not quite understand and look forward to your response.
1. Write concurrency control. Consider one case with two threads with mkdir 
(/a/b/c/d/e) and delete(/a/b/c) ops. I try to ran this case following design 
and POC patches, but I usually get unstable result since key with  
and  could be located at different RangeGSet using 
{{INodeMap#latchWriteLock}}, then the two threads could run concurrently and 
get unstable result even if from one client and one by one. As your last 
explains, `deleting a directory should lock all RangeGets involved`. Is it one 
special case about Delete Ops? Sorry for asking this question again.
{quote}
Deleting a directory /a/b/c means deleting the entire sub-tree underneath this 
directory. We should lock all RangeGSets involved in such deletion, 
particularly the one containing file f. So f cannot be modified concurrently 
with the delete.
{quote}
2. {{INode}} involves local variable {{long[] namespaceKey}} at 0004 in POC 
package. I believe this attributes is very useful to partition for INode. 
meanwhile does it bring some other potential issues
* heap footprint overhead. For a long while running of NameNode process, 
namespaceKey of most INode (visited once at least) in the directory tree may be 
not null. If we consider there are 500M INodes and {{level}} is both 2, it need 
over than 8GB heap size.
* when one INode is renamed, the {{namespaceKey}} have to update, right? Since 
its parent INode has changes. POC seems not update anymore if {{namespaceKey}} 
is not null.
Is it possible to calculate namespaceKey for INode when use it out of the Lock. 
Of course, it will bring CPU overhead. Please correct me if I am wrong. Thanks.
3. No LatchLock unlock in the POC for operation #mkdir, it seems like a bit of 
oversight. In my opinion, it has to release childLock after used, right?
[~shv] Thanks for your POC patches again and looks forward to the next 
milestone. And I would like to involve to push forward this feature if need.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-14 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907662#comment-16907662
 ] 

Konstantin Shvachko commented on HDFS-14703:


Attaching the POC patch. It consists of 4 commits. Apply using {{git am 
001-partitioned-inodeMap-POC/*}} command.
# 0001 patch is an investigation to verify that FSN lock is used together with 
dirLock. I just ran unit tests with this patch. Most of them pass the 
verification, but some don't.
# 0002 patch disables dirLock.
# 0003 introduces PartitionedGSet, LatchLock. It implements dynamic 
partitioning based on inodeId key (see INodeIdComparator)
# 0004 introduces two-level key and implements static partitioning based on 
that key.

With 0003 and 0004 patches I ran NNThroughputBenchmark creating 2 million 
directories with 200 concurrent threads and 128 subdirectories.
So the POC implements new locking for one operation only - mkdir.
The benchmark command: {{NNThroughputBenchmark -fs file:/// -op mkdirs -threads 
200 -dirs 200 -dirsPerDir 128}}

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode 
> Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-08 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903374#comment-16903374
 ] 

Konstantin Shvachko commented on HDFS-14703:


Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions:
# "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have 
keys, respectively, {{}} and {{}}, which have 
common prefix {{}} and therefore are likely to fall into the same 
RangeGSet. In your example {{}} is the parent of {{}} and this key definition does not guarantee them to be in the same range.
# Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath 
this directory. We should lock all RangeGSets involved in such deletion, 
particularly the one containing containing file {{f}}. So {{f}} cannot be 
modified concurrently with the delete.
# Just to clarify RangeMap is the upper level part of PartitionedGSet, which 
maps key ranges into RangeGSets. So there is only one RangeMap and many 
RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You 
make a good point that some operations like failover, large deletes, renames, 
quota changes will still require a global lock. The lock on RangeMap could play 
the role of such global lock. This should be defined in more details within the 
design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock 
for some operations. This will also help us gradually switch operations from 
FSNamesystemLock to LatchLock.
# I don't know what the next bottleneck we will see, but you are absolutely 
correct there will be something. For edits log, I indeed saw while running my 
benchmarks that the number of transactions batched together while journaling 
was increasing. This is expected and desirable behavior, since writing large 
batches to a disk is more efficient than lots of small writes.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-07 Thread He Xiaoqiao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901887#comment-16901887
 ] 

He Xiaoqiao commented on HDFS-14703:


Thanks [~shv] for file this JIRA and plan to push this feature forward, it is 
very great work. Really appreciate doing this.
 There are some details I am confused after reading the design document.
 As design document said, each inode maps (through inode key) to one RangeMap 
who has a separate lock and carry out concurrently.
{quote}The inode key is a fixed length sequence of parent inodeids ending with 
the file inode id itself:
    key(f) = 
 Where selfId is the inodeId of file f, pId is the id of its parent, and ppId 
is the id of the parent of the parent. Such definition of a key guarantees that 
not only siblings but also cousins (objects having the same grandparent) are 
partitioned into the same range most of the time
{quote}
Consider the following path: /a/b/c/d/e, corresponding inode id is [ida, idb, 
idc, idd].
 1. How we could guarantee to map 'cousins' into the same range? In my first 
opinion, it could map to different RangeMaps, since for idc, its inode key = 
 and for idd its inode key = .
 2. Any consideration about operating one nodes and its ancestor node 
concurrently? for instance, /a/b/c/d/e/f, we could delete inode c and modify 
inode f at the same time if they map to different range since we do not 
guarantee map them to the same one. maybe it is problem in the case.
 3. Which lock will be hold if request some global request like ha failover, 
safemode etc.? do we need to obtain all RangeMap lock?
 4. Any bottleneck meet after improve write throughput, I believe that EditLog 
OPS will keep increase, and will it to be the new bottleneck?
Please correct me if I do not understand correctly. Thanks.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

2019-08-05 Thread Konstantin Shvachko (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900526#comment-16900526
 ] 

Konstantin Shvachko commented on HDFS-14703:


Attached the design document for review.

> NameNode Fine-Grained Locking via Metadata Partitioning
> ---
>
> Key: HDFS-14703
> URL: https://issues.apache.org/jira/browse/HDFS-14703
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org