Re: Discussion about NameNode Fine-grained locking

ZanderXu Wed, 05 Jun 2024 01:17:34 -0700

I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share the
FGL's motivations and some concerns in detail in Chinese.


The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II)
<https://docs.google.com/document/d/1QGLM67u6tWjj00gOWYqgxHqghb43g4dmH8QcUZtSrYE/edit?usp=sharing>

The meeting URL is: https://sea.zoom.us/j/94168001269

You are welcome to this meeting.

On Mon, 6 May 2024 at 23:57, Hui Fei <feihui.u...@gmail.com> wrote:

> BTW, there is a Slack channel hdfs-fgl for this feature. can join it and
> discuss more details.
>
> Is it necessary to hold a meeting to discuss this? So that we can push it
> forward quickly. Agreed with ZanderXu, it seems inefficient to discuss
> details via email list.
>
>
> Hui Fei <feihui.u...@gmail.com> 于2024年5月6日周一 23:50写道：
>
>> Thanks all
>>
>> Seems all concerns are related to the stage 2. We can address these and
>> make it more clear before we start it.
>>
>> From development experience, I think it is reasonable to split the big
>> feature into several stages. And stage 1 is also independent and it also
>> can be as a minor feature that uses fs and bm locks instead of the global
>> lock.
>>
>>
>> ZanderXu <zande...@apache.org> 于2024年4月29日周一 15:17写道：
>>
>>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He
>>> <hexiaoq...@apache.org> for your nice questions.
>>>
>>> Let me summarize your concerns and corresponding solutions:
>>>
>>> *1. Questions about the Snapshot feature*
>>> It's difficult to apply the FGL to Snapshot feature,  but we can just
>>> using
>>> the global FS write lock to make it thread safe.
>>> So if we can identity if a path contains the snapshot feature, we can
>>> just
>>> using the global FS write lock to protect it.
>>>
>>> You can refer to HDFS-17479
>>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to
>>> identify
>>> it.
>>>
>>> Regarding performance of the operations related to the snapshot features,
>>> we can discuss it in two categories:
>>> Read operations involves snapshots:
>>> The FGL branch uses the global write lock to protect them, the GLOBAL
>>> branch uses the global read lock to protect them. It's hard to conclude
>>> which version has better performance, it depends on the global lock
>>> competition.
>>>
>>> Write operations involves snapshots:
>>> Both FGL and GLOBAL branch use the global write lock to protect them.
>>> It's
>>> hard to conclude which version has better performance, it depends on the
>>> global lock competition too.
>>>
>>> So I think if namenode load is low, the GLOBAL branch will have a better
>>> performance than FGL; If namenode load is high, the FGL branch may have a
>>> better performance than the GLOBAL, which also depends on the ratio of
>>> read
>>> and write operations on the SNAPSHOT feature.
>>>
>>> We can do somethings to let end-user to choose a branch with a better
>>> branch according to their business:
>>> First, we need to make the lock mode can be selectable, so that end-user
>>> can choose to use FGL of GLOBAL.
>>> Second, using the global write lock to make operations related to
>>> snapshot
>>> thread safe as I described in HDFS-17479.
>>>
>>>
>>> *2. Questions about the Symlinks feature*
>>> If Symlink is related to snapshot, we can refer to the solution of the
>>> snapshot;  If Symlink is not related to snapshot, I think it's easy to
>>> meet
>>> the FGL.
>>> Only createSymlink involves two paths, FGL just need to lock them in the
>>> order to make this operation thread. For other operations, it is the same
>>> as other normal iNode, right?
>>>
>>> If I missed difficult points, please let me know.
>>>
>>>
>>> *3. Questions about Memory Usage of iNode locks*
>>> I think there are too many solutions to limit the memory usage of these
>>> iNode locks, such as: Using a limit capacity lock pool to ensure the
>>> maximum memory usage,  Just holding iNode locks for fixed depth of
>>> directories, etc.
>>>
>>> We can just abstract this LockManager first and then support its
>>> implementation with different ideas, so that we can limit the maximum
>>> memory usage of these iNode locks.
>>> FGL can acquire or lease iNode locks through LockManager.
>>>
>>>
>>> *4. Questions about Performance of acquiring and releasing iNode locks*
>>> We can add some benchmark for LockManager, to test the performance or
>>> acquire and release unblocked locks.
>>>
>>>
>>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.*
>>> These policies may be sot on an ancestor node and used by some children
>>> files.  The set operation for these policies will be protected by the
>>> directory tree, since there are all file-related operations.  In addition
>>> to Quota and StoragePolicy, the use of other policies will also be
>>> protected by directory tree, such as ECPolicy and ACL.
>>>
>>> Quota is a little special since its update operations may not be
>>> protected
>>> by the directory tree, we can assign a locks to each QuotaFeature and use
>>> these locks to make updating operations thread safe. you can refer to
>>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get
>>> some
>>> detailed information.
>>>
>>> StoragePolicy is a little special since it is used not only by
>>> file-related
>>> operations but also block-related operations.
>>> ProcessExtraRedundancyBlock
>>> uses storage policy to choose redundancy replicas and
>>> BlockReconstructionWork uses storage policy to choose target DNs. In
>>> order
>>> to maximize the performance improvement, BR and IBR should only involve
>>> the
>>> iNodeFile to which the current processing block belongs. These redundancy
>>> blocks can be processed by the Redundancy monitor while holding the
>>> directory tree locks. You can refer to HDFS-17505
>>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more detailed
>>> informations.
>>>
>>> *6. Performance of the phase 1*
>>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used
>>> to do
>>> some performance testing for phase 1, and I will complete it later.
>>>
>>>
>>> Discuss solution through mails is not efficient, you can create one
>>> sub-tasks under HDFS-17366
>>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your
>>> concerns and I will try to give some answers.
>>>
>>> Thanks @Ayush Saxena <ayush...@gmail.com>  and @Xiaoqiao He
>>> <hexiaoq...@apache.org> again.
>>>
>>>
>>>
>>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <ayush...@gmail.com> wrote:
>>>
>>> > Thanx Everyone for chasing this, Great to see some momentum around FGL,
>>> > that should be a great improvement.
>>> >
>>> > I have some two broad categories:
>>> > ** About the process:*
>>> > I think in the above mails, there are mentions that phase one is
>>> complete
>>> > in a feature branch & we are gonna merge that to trunk. If I am
>>> catching it
>>> > right, then you can't hit the merge button like that. To merge a
>>> feature
>>> > branch. You need to call for a Vote specific to that branch & it
>>> requires 3
>>> > binding votes to merge, unlike any other code change which requires 1.
>>> It
>>> > is there in our Bylaws.
>>> >
>>> > So, do follow the process.
>>> >
>>> > ** About the feature itself:* (A very quick look at the doc and the
>>> Jira,
>>> > so please take it with a grain of salt)
>>> > * The Google Drive link that you folks shared as part of the first
>>> mail. I
>>> > don't have access to that. So, please open up the permissions for that
>>> doc
>>> > or share the new link
>>> > * Chasing the design doc present on the Jira
>>> > * I think we only have Phase-1 ready, so can you share some metrics
>>> just
>>> > for that? Perf improvements just with splitting the FS & BM Locks
>>> > * The memory implications of Phase-1? I don't think there should be any
>>> > major impact on the memory in case of just phase-1
>>> > * Regarding the snapshot stuff, you mentioned taking lock on the root
>>> > itself? Does just taking lock on the snapshot root rather than the FS
>>> root
>>> > works?
>>> > * Secondly about the usage of Snapshot or Symlinks, I don't think we
>>> > should operate under the assumptions that they aren't widely used or
>>> not,
>>> > we might just not know folks who don't use it widely or they are just
>>> users
>>> > not the ones contributing. We can just accept for now, that in those
>>> cases
>>> > it isn't optimised and we just lock the entire FS space, which it does
>>> even
>>> > today, so no regressions there.
>>> > * Regarding memory usage: Do you have some numbers on how much the
>>> memory
>>> > footprint increases?
>>> > * Under the Lock Pool: I think you are assuming there would be very few
>>> > inodes where lock would be required at any given time, so there won't
>>> be
>>> > too much heap consumption? I think you are compromising on the
>>> Horizontal
>>> > Scalability here. I doubt if your assumption doesn't hold true, under
>>> heavy
>>> > read load by concurrent clients accessing different inodes, the
>>> Namenode
>>> > will start giving memory troubles, that would do more harm than good.
>>> > Anyway Namenode heap is way bigger problem than anything, so we should
>>> be
>>> > very careful increasing load over there.
>>> > * For the Locks on the inodes: Do you plan to have locs for each inode?
>>> > Can we somehow limit that to the depth of the tree? Like currently we
>>> take
>>> > lock on the root, have a config which makes us take lock at Level-2 or
>>> 3
>>> > (configurable), that might fetch some perf benefits and can be used to
>>> > control the memory usage as well?
>>> > * What is the cost of creating these inode locks? If the lock isn't
>>> > already cached it would incur some cost? Do you have some numbers
>>> around
>>> > that? Say I disable caching altogether & then let a test load run, what
>>> > does the perf numbers look like in that case
>>> > * I think we need to limit the size of INodeLockPool, we can't let it
>>> grow
>>> > infinitely in case of heavy loads and we need to have some auto
>>> > throttling mechanism for it
>>> > * I didn't catch your Storage Policy problem. If I decode it right, the
>>> > problem is like the policy could be set on an ancestor node & the
>>> children
>>> > abide by that & this is the problem, if that is the case then isn't
>>> that
>>> > the case with ErasureCoding policies or even ACLs or so? Can you
>>> elaborate
>>> > a bit on that.
>>> >
>>> >
>>> > Anyway, regarding the Phase-1. If you share (the perf numbers with
>>> proper
>>> > details + Impact on memory if any) for just phase 1 & if they are good,
>>> > then if you call for a branch merge vote for Phase-1 FGL, you have my
>>> vote,
>>> > however you'll need to sway the rest of the folks on your own :-)
>>> >
>>> > Good Luck, Nice Work Guys!!!
>>> >
>>> > -Ayush
>>> >
>>> >
>>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <hexiaoq...@apache.org>
>>> wrote:
>>> >
>>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will be
>>> >> a very helpful improvement for the HDFS module in the next journal.
>>> >>
>>> >> 1. If we need any more review bandwidth, I would like to be involved
>>> >> to help review if possible.
>>> >> 2. From the design document there are still missing some detailed
>>> >> descriptions such as snapshot, symbolic link and reserved etc as
>>> mentioned
>>> >> above. I think it will be helpful for newbies who want to be involved
>>> >> if all corner
>>> >> cases are considered and described.
>>> >> 3. From slack, we plan to check into the trunk at this phase. I am not
>>> >> sure
>>> >> If it is the proper time, following the dev plan there are two steps
>>> left
>>> >> to
>>> >> finish this feature from the design document, right? If that, I think
>>> we
>>> >> should
>>> >> postpone checking in when all plans are ready. Considering that there
>>> are
>>> >> many unfinished tries for this feature in history, I think postpone
>>> >> checking
>>> >> will be the safe way, another way it will involve more rebase cost if
>>> you
>>> >> keep
>>> >> separate dev branch, however I think It is not one difficult thing for
>>> >> you.
>>> >>
>>> >> Good luck and look forward to making that happen soon!
>>> >>
>>> >> Best Regards,
>>> >> - He Xiaoqiao
>>> >>
>>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <feihui.u...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Thanks for interest and advice on this.
>>> >> >
>>> >> > Just would like to share some info here
>>> >> >
>>> >> > ZanderXu leads this feature and he has spent a lot of time on it.
>>> He is
>>> >> the main developer in stage 1.  Yuanboliu and Kokonguyen191 also took
>>> some
>>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99
>>> RocMarshal
>>> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone)
>>> >> >
>>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very
>>> >> familiar with this feature. We discussed many details offline.
>>> >> >
>>> >> > Welcome to more people interested in joining the development and
>>> review
>>> >> of the stage 2 and 3.
>>> >> >
>>> >> >
>>> >> > Zengqiang XU <xuzengqiang5...@gmail.com> 于2024年4月26日周五 14:56写道：
>>> >> >>
>>> >> >> Thanks Shilun for your response:
>>> >> >>
>>> >> >> 1. This is a big and very useful feature, so it really needs more
>>> >> >> developers to get on board.
>>> >> >> 2. This fine grained lock has been implemented based on internal
>>> >> branches
>>> >> >> and has gained benefits by many companies, such as: Meituan,
>>> Kuaishou,
>>> >> >> Bytedance, etc.  But it has not been contributed to the community
>>> due
>>> >> to
>>> >> >> various reasons, such as there is a big difference between the
>>> version
>>> >> of
>>> >> >> the internal branch and the community trunk branch, the internal
>>> >> branch may
>>> >> >> ignore some functions to make FGL clear, and the contribution
>>> needs a
>>> >> lot
>>> >> >> of work and will take many times. It means that this solution has
>>> >> already
>>> >> >> been practiced in their prod environment. We have also practiced
>>> it in
>>> >> our
>>> >> >> prod environment and gained benefits, and we are also willing to
>>> spend
>>> >> a
>>> >> >> lot of time contributing to the community.
>>> >> >> 3. Regarding the benchmark testing, we don't need to pay more
>>> >> attention to
>>> >> >> whether the performance is improved by 5 times, 10 times or 20
>>> times,
>>> >> >> because there are too many factors that affect it.
>>> >> >> 4. As I described above, this solution is already  being practiced
>>> by
>>> >> many
>>> >> >> companies. Right now, we just need to think about how to implement
>>> it
>>> >> with
>>> >> >> high quality and more comprehensively.
>>> >> >> 5. I firmly believe that all problems can be solved as long as the
>>> >> overall
>>> >> >> solution is right.
>>> >> >> 6. I can spend a lot of time leading the promotion of this entire
>>> >> feature
>>> >> >> and I hope more people can join us in promoting it.
>>> >> >> 7. You are always welcome to raise your concerns.
>>> >> >>
>>> >> >>
>>> >> >> Thanks Shilun again, I hope you can help review designs and PRs.
>>> Thanks
>>> >> >>
>>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <slfan1...@apache.org>
>>> wrote:
>>> >> >>
>>> >> >> > Thank you for your hard work! This is a very meaningful
>>> improvement,
>>> >> and
>>> >> >> > from the design document, we can see a significant increase in
>>> HDFS
>>> >> >> > read/write throughput.
>>> >> >> >
>>> >> >> > I am happy to see the progress made on HDFS-17384.
>>> >> >> >
>>> >> >> > However, I still have some concerns, which roughly involve the
>>> >> following
>>> >> >> > aspects:
>>> >> >> >
>>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and are
>>> >> familiar
>>> >> >> > with related development details, we still need more community
>>> >> member to
>>> >> >> > review the code to ensure that the relevant upgrades meet
>>> >> expectations.
>>> >> >> >
>>> >> >> > 2. We need more details on benchmarks to ensure that test results
>>> >> can be
>>> >> >> > reproduced and to allow more community member to participate in
>>> the
>>> >> testing
>>> >> >> > process.
>>> >> >> >
>>> >> >> > Looking forward to everything going smoothly in the future.
>>> >> >> >
>>> >> >> > Best Regards,
>>> >> >> > - Shilun Fan.
>>> >> >> >
>>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He <
>>> hexiaoq...@apache.org>
>>> >> wrote:
>>> >> >> >
>>> >> >> >> cc private@h.a.o.
>>> >> >> >>
>>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <zande...@apache.org>
>>> >> wrote:
>>> >> >> >> >
>>> >> >> >> > Here are some summaries about the first phase:
>>> >> >> >> > 1. There are no big changes in this phase
>>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the
>>> >> original
>>> >> >> >> global
>>> >> >> >> > lock
>>> >> >> >> > 3. It's useful to improve the performance, since some
>>> operations
>>> >> just
>>> >> >> >> need
>>> >> >> >> > to hold FS lock or BM lock instead of the global lock
>>> >> >> >> > 4. This feature is turned off by default, you can enable it by
>>> >> setting
>>> >> >> >> > dfs.namenode.lock.model.provider.class to
>>> >> >> >> >
>>> >> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock
>>> >> >> >> > 5. This phase is very import for the ongoing development of
>>> the
>>> >> entire
>>> >> >> >> FGL
>>> >> >> >> >
>>> >> >> >> > Here I would like to express my special thanks to
>>> @kokonguyen191
>>> >> and
>>> >> >> >> > @yuanboliu for their contributions.  And you are also welcome
>>> to
>>> >> join us
>>> >> >> >> > and complete it together.
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <zande...@apache.org>
>>> >> wrote:
>>> >> >> >> >
>>> >> >> >> > > Hi everyone
>>> >> >> >> > >
>>> >> >> >> > > All subtasks of the first phase of the FGL have been
>>> completed
>>> >> and I
>>> >> >> >> plan
>>> >> >> >> > > to merge them into the trunk and start the second phase
>>> based
>>> >> on the
>>> >> >> >> trunk.
>>> >> >> >> > >
>>> >> >> >> > > Here is the PR that used to merge the first phases into
>>> trunk:
>>> >> >> >> > > https://github.com/apache/hadoop/pull/6762
>>> >> >> >> > > Here is the ticket:
>>> >> https://issues.apache.org/jira/browse/HDFS-17384
>>> >> >> >> > >
>>> >> >> >> > > I hope you can help to review this PR when you are available
>>> >> and give
>>> >> >> >> some
>>> >> >> >> > > ideas.
>>> >> >> >> > >
>>> >> >> >> > >
>>> >> >> >> > > HDFS-17385 <
>>> https://issues.apache.org/jira/browse/HDFS-17385>
>>> >> is
>>> >> >> >> used for
>>> >> >> >> > > the second phase and I have created some subtasks to
>>> describe
>>> >> >> >> solutions for
>>> >> >> >> > > some problems, such as: snapshot, getListing, quota.
>>> >> >> >> > > You are welcome to join us to complete it together.
>>> >> >> >> > >
>>> >> >> >> > >
>>> >> >> >> > > ---------- Forwarded message ---------
>>> >> >> >> > > From: Zengqiang XU <zande...@apache.org>
>>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07
>>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking
>>> >> >> >> > > To: <hdfs-dev@hadoop.apache.org>
>>> >> >> >> > > Cc: Zengqiang XU <xuzengqiang5...@gmail.com>
>>> >> >> >> > >
>>> >> >> >> > >
>>> >> >> >> > > Hi everyone
>>> >> >> >> > >
>>> >> >> >> > > I have started a discussion about NameNode Fine-grained
>>> Locking
>>> >> to
>>> >> >> >> improve
>>> >> >> >> > > performance of write operations in NameNode.
>>> >> >> >> > >
>>> >> >> >> > > I started this discussion again for serval main reasons:
>>> >> >> >> > > 1. We have implemented it and gained nearly 7x performance
>>> >> >> >> improvement in
>>> >> >> >> > > our prod environment
>>> >> >> >> > > 2. Many other companies made similar improvements based on
>>> their
>>> >> >> >> internal
>>> >> >> >> > > branch.
>>> >> >> >> > > 3. This topic has been discussed for a long time, but still
>>> >> without
>>> >> >> >> any
>>> >> >> >> > > results.
>>> >> >> >> > >
>>> >> >> >> > > I hope we can push this important improvement in the
>>> community
>>> >> so
>>> >> >> >> that all
>>> >> >> >> > > end-users can enjoy this significant improvement.
>>> >> >> >> > >
>>> >> >> >> > > I'd really appreciate you can join in and work with me to
>>> push
>>> >> this
>>> >> >> >> > > feature forward.
>>> >> >> >> > >
>>> >> >> >> > > Thanks very much.
>>> >> >> >> > >
>>> >> >> >> > > Ticket: HDFS-17366 <
>>> >> https://issues.apache.org/jira/browse/HDFS-17366>
>>> >> >> >> > > Design: NameNode Fine-grained locking based on directory
>>> tree
>>> >> >> >> > > <
>>> >> >> >>
>>> >>
>>> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing
>>> >> >> >> >
>>> >> >> >> > >
>>> >> >> >>
>>> >> >> >>
>>> >> ---------------------------------------------------------------------
>>> >> >> >> To unsubscribe, e-mail: private-unsubscr...@hadoop.apache.org
>>> >> >> >> For additional commands, e-mail: private-h...@hadoop.apache.org
>>> >> >> >>
>>> >> >> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>>> >> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>>> >>
>>> >>
>>>
>>

Re: Discussion about NameNode Fine-grained locking

Reply via email to