I plan to hold a meeting on 2024-06-06 from 3:00 PM - 4:00 PM to share the FGL's motivations and some concerns in detail in Chinese.
The doc is : NameNode Fine-Grained Locking Based On Directory Tree (II) <https://docs.google.com/document/d/1QGLM67u6tWjj00gOWYqgxHqghb43g4dmH8QcUZtSrYE/edit?usp=sharing> The meeting URL is: https://sea.zoom.us/j/94168001269 You are welcome to this meeting. On Mon, 6 May 2024 at 23:57, Hui Fei <feihui.u...@gmail.com> wrote: > BTW, there is a Slack channel hdfs-fgl for this feature. can join it and > discuss more details. > > Is it necessary to hold a meeting to discuss this? So that we can push it > forward quickly. Agreed with ZanderXu, it seems inefficient to discuss > details via email list. > > > Hui Fei <feihui.u...@gmail.com> 于2024年5月6日周一 23:50写道: > >> Thanks all >> >> Seems all concerns are related to the stage 2. We can address these and >> make it more clear before we start it. >> >> From development experience, I think it is reasonable to split the big >> feature into several stages. And stage 1 is also independent and it also >> can be as a minor feature that uses fs and bm locks instead of the global >> lock. >> >> >> ZanderXu <zande...@apache.org> 于2024年4月29日周一 15:17写道: >> >>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He >>> <hexiaoq...@apache.org> for your nice questions. >>> >>> Let me summarize your concerns and corresponding solutions: >>> >>> *1. Questions about the Snapshot feature* >>> It's difficult to apply the FGL to Snapshot feature, but we can just >>> using >>> the global FS write lock to make it thread safe. >>> So if we can identity if a path contains the snapshot feature, we can >>> just >>> using the global FS write lock to protect it. >>> >>> You can refer to HDFS-17479 >>> <https://issues.apache.org/jira/browse/HDFS-17479> to get how to >>> identify >>> it. >>> >>> Regarding performance of the operations related to the snapshot features, >>> we can discuss it in two categories: >>> Read operations involves snapshots: >>> The FGL branch uses the global write lock to protect them, the GLOBAL >>> branch uses the global read lock to protect them. It's hard to conclude >>> which version has better performance, it depends on the global lock >>> competition. >>> >>> Write operations involves snapshots: >>> Both FGL and GLOBAL branch use the global write lock to protect them. >>> It's >>> hard to conclude which version has better performance, it depends on the >>> global lock competition too. >>> >>> So I think if namenode load is low, the GLOBAL branch will have a better >>> performance than FGL; If namenode load is high, the FGL branch may have a >>> better performance than the GLOBAL, which also depends on the ratio of >>> read >>> and write operations on the SNAPSHOT feature. >>> >>> We can do somethings to let end-user to choose a branch with a better >>> branch according to their business: >>> First, we need to make the lock mode can be selectable, so that end-user >>> can choose to use FGL of GLOBAL. >>> Second, using the global write lock to make operations related to >>> snapshot >>> thread safe as I described in HDFS-17479. >>> >>> >>> *2. Questions about the Symlinks feature* >>> If Symlink is related to snapshot, we can refer to the solution of the >>> snapshot; If Symlink is not related to snapshot, I think it's easy to >>> meet >>> the FGL. >>> Only createSymlink involves two paths, FGL just need to lock them in the >>> order to make this operation thread. For other operations, it is the same >>> as other normal iNode, right? >>> >>> If I missed difficult points, please let me know. >>> >>> >>> *3. Questions about Memory Usage of iNode locks* >>> I think there are too many solutions to limit the memory usage of these >>> iNode locks, such as: Using a limit capacity lock pool to ensure the >>> maximum memory usage, Just holding iNode locks for fixed depth of >>> directories, etc. >>> >>> We can just abstract this LockManager first and then support its >>> implementation with different ideas, so that we can limit the maximum >>> memory usage of these iNode locks. >>> FGL can acquire or lease iNode locks through LockManager. >>> >>> >>> *4. Questions about Performance of acquiring and releasing iNode locks* >>> We can add some benchmark for LockManager, to test the performance or >>> acquire and release unblocked locks. >>> >>> >>> *5. Questions about StoragePolicy, ECPolicy, ACL, Quota, etc.* >>> These policies may be sot on an ancestor node and used by some children >>> files. The set operation for these policies will be protected by the >>> directory tree, since there are all file-related operations. In addition >>> to Quota and StoragePolicy, the use of other policies will also be >>> protected by directory tree, such as ECPolicy and ACL. >>> >>> Quota is a little special since its update operations may not be >>> protected >>> by the directory tree, we can assign a locks to each QuotaFeature and use >>> these locks to make updating operations thread safe. you can refer to >>> HDFS-17473 <https://issues.apache.org/jira/browse/HDFS-17473> to get >>> some >>> detailed information. >>> >>> StoragePolicy is a little special since it is used not only by >>> file-related >>> operations but also block-related operations. >>> ProcessExtraRedundancyBlock >>> uses storage policy to choose redundancy replicas and >>> BlockReconstructionWork uses storage policy to choose target DNs. In >>> order >>> to maximize the performance improvement, BR and IBR should only involve >>> the >>> iNodeFile to which the current processing block belongs. These redundancy >>> blocks can be processed by the Redundancy monitor while holding the >>> directory tree locks. You can refer to HDFS-17505 >>> <https://issues.apache.org/jira/browse/HDFS-17505> to get more detailed >>> informations. >>> >>> *6. Performance of the phase 1* >>> HDFS-17506 <https://issues.apache.org/jira/browse/HDFS-17506> is used >>> to do >>> some performance testing for phase 1, and I will complete it later. >>> >>> >>> Discuss solution through mails is not efficient, you can create one >>> sub-tasks under HDFS-17366 >>> <https://issues.apache.org/jira/browse/HDFS-17366> to describe your >>> concerns and I will try to give some answers. >>> >>> Thanks @Ayush Saxena <ayush...@gmail.com> and @Xiaoqiao He >>> <hexiaoq...@apache.org> again. >>> >>> >>> >>> On Mon, 29 Apr 2024 at 02:00, Ayush Saxena <ayush...@gmail.com> wrote: >>> >>> > Thanx Everyone for chasing this, Great to see some momentum around FGL, >>> > that should be a great improvement. >>> > >>> > I have some two broad categories: >>> > ** About the process:* >>> > I think in the above mails, there are mentions that phase one is >>> complete >>> > in a feature branch & we are gonna merge that to trunk. If I am >>> catching it >>> > right, then you can't hit the merge button like that. To merge a >>> feature >>> > branch. You need to call for a Vote specific to that branch & it >>> requires 3 >>> > binding votes to merge, unlike any other code change which requires 1. >>> It >>> > is there in our Bylaws. >>> > >>> > So, do follow the process. >>> > >>> > ** About the feature itself:* (A very quick look at the doc and the >>> Jira, >>> > so please take it with a grain of salt) >>> > * The Google Drive link that you folks shared as part of the first >>> mail. I >>> > don't have access to that. So, please open up the permissions for that >>> doc >>> > or share the new link >>> > * Chasing the design doc present on the Jira >>> > * I think we only have Phase-1 ready, so can you share some metrics >>> just >>> > for that? Perf improvements just with splitting the FS & BM Locks >>> > * The memory implications of Phase-1? I don't think there should be any >>> > major impact on the memory in case of just phase-1 >>> > * Regarding the snapshot stuff, you mentioned taking lock on the root >>> > itself? Does just taking lock on the snapshot root rather than the FS >>> root >>> > works? >>> > * Secondly about the usage of Snapshot or Symlinks, I don't think we >>> > should operate under the assumptions that they aren't widely used or >>> not, >>> > we might just not know folks who don't use it widely or they are just >>> users >>> > not the ones contributing. We can just accept for now, that in those >>> cases >>> > it isn't optimised and we just lock the entire FS space, which it does >>> even >>> > today, so no regressions there. >>> > * Regarding memory usage: Do you have some numbers on how much the >>> memory >>> > footprint increases? >>> > * Under the Lock Pool: I think you are assuming there would be very few >>> > inodes where lock would be required at any given time, so there won't >>> be >>> > too much heap consumption? I think you are compromising on the >>> Horizontal >>> > Scalability here. I doubt if your assumption doesn't hold true, under >>> heavy >>> > read load by concurrent clients accessing different inodes, the >>> Namenode >>> > will start giving memory troubles, that would do more harm than good. >>> > Anyway Namenode heap is way bigger problem than anything, so we should >>> be >>> > very careful increasing load over there. >>> > * For the Locks on the inodes: Do you plan to have locs for each inode? >>> > Can we somehow limit that to the depth of the tree? Like currently we >>> take >>> > lock on the root, have a config which makes us take lock at Level-2 or >>> 3 >>> > (configurable), that might fetch some perf benefits and can be used to >>> > control the memory usage as well? >>> > * What is the cost of creating these inode locks? If the lock isn't >>> > already cached it would incur some cost? Do you have some numbers >>> around >>> > that? Say I disable caching altogether & then let a test load run, what >>> > does the perf numbers look like in that case >>> > * I think we need to limit the size of INodeLockPool, we can't let it >>> grow >>> > infinitely in case of heavy loads and we need to have some auto >>> > throttling mechanism for it >>> > * I didn't catch your Storage Policy problem. If I decode it right, the >>> > problem is like the policy could be set on an ancestor node & the >>> children >>> > abide by that & this is the problem, if that is the case then isn't >>> that >>> > the case with ErasureCoding policies or even ACLs or so? Can you >>> elaborate >>> > a bit on that. >>> > >>> > >>> > Anyway, regarding the Phase-1. If you share (the perf numbers with >>> proper >>> > details + Impact on memory if any) for just phase 1 & if they are good, >>> > then if you call for a branch merge vote for Phase-1 FGL, you have my >>> vote, >>> > however you'll need to sway the rest of the folks on your own :-) >>> > >>> > Good Luck, Nice Work Guys!!! >>> > >>> > -Ayush >>> > >>> > >>> > On Sun, 28 Apr 2024 at 18:32, Xiaoqiao He <hexiaoq...@apache.org> >>> wrote: >>> > >>> >> Thanks ZanderXu and Hui Fei for your work on this feature. It will be >>> >> a very helpful improvement for the HDFS module in the next journal. >>> >> >>> >> 1. If we need any more review bandwidth, I would like to be involved >>> >> to help review if possible. >>> >> 2. From the design document there are still missing some detailed >>> >> descriptions such as snapshot, symbolic link and reserved etc as >>> mentioned >>> >> above. I think it will be helpful for newbies who want to be involved >>> >> if all corner >>> >> cases are considered and described. >>> >> 3. From slack, we plan to check into the trunk at this phase. I am not >>> >> sure >>> >> If it is the proper time, following the dev plan there are two steps >>> left >>> >> to >>> >> finish this feature from the design document, right? If that, I think >>> we >>> >> should >>> >> postpone checking in when all plans are ready. Considering that there >>> are >>> >> many unfinished tries for this feature in history, I think postpone >>> >> checking >>> >> will be the safe way, another way it will involve more rebase cost if >>> you >>> >> keep >>> >> separate dev branch, however I think It is not one difficult thing for >>> >> you. >>> >> >>> >> Good luck and look forward to making that happen soon! >>> >> >>> >> Best Regards, >>> >> - He Xiaoqiao >>> >> >>> >> On Fri, Apr 26, 2024 at 3:50 PM Hui Fei <feihui.u...@gmail.com> >>> wrote: >>> >> > >>> >> > Thanks for interest and advice on this. >>> >> > >>> >> > Just would like to share some info here >>> >> > >>> >> > ZanderXu leads this feature and he has spent a lot of time on it. >>> He is >>> >> the main developer in stage 1. Yuanboliu and Kokonguyen191 also took >>> some >>> >> tasks. Other developers (slfan1989 haiyang1987 huangzhaobo99 >>> RocMarshal >>> >> kokonguyen191) helped review PRs. (Forgive me if I missed someone) >>> >> > >>> >> > Actually haiyang1987, Yuanboliu and Kokonguyen191 are also very >>> >> familiar with this feature. We discussed many details offline. >>> >> > >>> >> > Welcome to more people interested in joining the development and >>> review >>> >> of the stage 2 and 3. >>> >> > >>> >> > >>> >> > Zengqiang XU <xuzengqiang5...@gmail.com> 于2024年4月26日周五 14:56写道: >>> >> >> >>> >> >> Thanks Shilun for your response: >>> >> >> >>> >> >> 1. This is a big and very useful feature, so it really needs more >>> >> >> developers to get on board. >>> >> >> 2. This fine grained lock has been implemented based on internal >>> >> branches >>> >> >> and has gained benefits by many companies, such as: Meituan, >>> Kuaishou, >>> >> >> Bytedance, etc. But it has not been contributed to the community >>> due >>> >> to >>> >> >> various reasons, such as there is a big difference between the >>> version >>> >> of >>> >> >> the internal branch and the community trunk branch, the internal >>> >> branch may >>> >> >> ignore some functions to make FGL clear, and the contribution >>> needs a >>> >> lot >>> >> >> of work and will take many times. It means that this solution has >>> >> already >>> >> >> been practiced in their prod environment. We have also practiced >>> it in >>> >> our >>> >> >> prod environment and gained benefits, and we are also willing to >>> spend >>> >> a >>> >> >> lot of time contributing to the community. >>> >> >> 3. Regarding the benchmark testing, we don't need to pay more >>> >> attention to >>> >> >> whether the performance is improved by 5 times, 10 times or 20 >>> times, >>> >> >> because there are too many factors that affect it. >>> >> >> 4. As I described above, this solution is already being practiced >>> by >>> >> many >>> >> >> companies. Right now, we just need to think about how to implement >>> it >>> >> with >>> >> >> high quality and more comprehensively. >>> >> >> 5. I firmly believe that all problems can be solved as long as the >>> >> overall >>> >> >> solution is right. >>> >> >> 6. I can spend a lot of time leading the promotion of this entire >>> >> feature >>> >> >> and I hope more people can join us in promoting it. >>> >> >> 7. You are always welcome to raise your concerns. >>> >> >> >>> >> >> >>> >> >> Thanks Shilun again, I hope you can help review designs and PRs. >>> Thanks >>> >> >> >>> >> >> On Fri, 26 Apr 2024 at 08:00, slfan1989 <slfan1...@apache.org> >>> wrote: >>> >> >> >>> >> >> > Thank you for your hard work! This is a very meaningful >>> improvement, >>> >> and >>> >> >> > from the design document, we can see a significant increase in >>> HDFS >>> >> >> > read/write throughput. >>> >> >> > >>> >> >> > I am happy to see the progress made on HDFS-17384. >>> >> >> > >>> >> >> > However, I still have some concerns, which roughly involve the >>> >> following >>> >> >> > aspects: >>> >> >> > >>> >> >> > 1. While ZanderXu and Hui Fei have deep expertise in HDFS and are >>> >> familiar >>> >> >> > with related development details, we still need more community >>> >> member to >>> >> >> > review the code to ensure that the relevant upgrades meet >>> >> expectations. >>> >> >> > >>> >> >> > 2. We need more details on benchmarks to ensure that test results >>> >> can be >>> >> >> > reproduced and to allow more community member to participate in >>> the >>> >> testing >>> >> >> > process. >>> >> >> > >>> >> >> > Looking forward to everything going smoothly in the future. >>> >> >> > >>> >> >> > Best Regards, >>> >> >> > - Shilun Fan. >>> >> >> > >>> >> >> > On Wed, Apr 24, 2024 at 3:51 PM Xiaoqiao He < >>> hexiaoq...@apache.org> >>> >> wrote: >>> >> >> > >>> >> >> >> cc private@h.a.o. >>> >> >> >> >>> >> >> >> On Wed, Apr 24, 2024 at 3:35 PM ZanderXu <zande...@apache.org> >>> >> wrote: >>> >> >> >> > >>> >> >> >> > Here are some summaries about the first phase: >>> >> >> >> > 1. There are no big changes in this phase >>> >> >> >> > 2. This phase just uses FS lock and BM lock to replace the >>> >> original >>> >> >> >> global >>> >> >> >> > lock >>> >> >> >> > 3. It's useful to improve the performance, since some >>> operations >>> >> just >>> >> >> >> need >>> >> >> >> > to hold FS lock or BM lock instead of the global lock >>> >> >> >> > 4. This feature is turned off by default, you can enable it by >>> >> setting >>> >> >> >> > dfs.namenode.lock.model.provider.class to >>> >> >> >> > >>> >> org.apache.hadoop.hdfs.server.namenode.fgl.FineGrainedFSNamesystemLock >>> >> >> >> > 5. This phase is very import for the ongoing development of >>> the >>> >> entire >>> >> >> >> FGL >>> >> >> >> > >>> >> >> >> > Here I would like to express my special thanks to >>> @kokonguyen191 >>> >> and >>> >> >> >> > @yuanboliu for their contributions. And you are also welcome >>> to >>> >> join us >>> >> >> >> > and complete it together. >>> >> >> >> > >>> >> >> >> > >>> >> >> >> > On Wed, 24 Apr 2024 at 14:54, ZanderXu <zande...@apache.org> >>> >> wrote: >>> >> >> >> > >>> >> >> >> > > Hi everyone >>> >> >> >> > > >>> >> >> >> > > All subtasks of the first phase of the FGL have been >>> completed >>> >> and I >>> >> >> >> plan >>> >> >> >> > > to merge them into the trunk and start the second phase >>> based >>> >> on the >>> >> >> >> trunk. >>> >> >> >> > > >>> >> >> >> > > Here is the PR that used to merge the first phases into >>> trunk: >>> >> >> >> > > https://github.com/apache/hadoop/pull/6762 >>> >> >> >> > > Here is the ticket: >>> >> https://issues.apache.org/jira/browse/HDFS-17384 >>> >> >> >> > > >>> >> >> >> > > I hope you can help to review this PR when you are available >>> >> and give >>> >> >> >> some >>> >> >> >> > > ideas. >>> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > HDFS-17385 < >>> https://issues.apache.org/jira/browse/HDFS-17385> >>> >> is >>> >> >> >> used for >>> >> >> >> > > the second phase and I have created some subtasks to >>> describe >>> >> >> >> solutions for >>> >> >> >> > > some problems, such as: snapshot, getListing, quota. >>> >> >> >> > > You are welcome to join us to complete it together. >>> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > ---------- Forwarded message --------- >>> >> >> >> > > From: Zengqiang XU <zande...@apache.org> >>> >> >> >> > > Date: Fri, 2 Feb 2024 at 11:07 >>> >> >> >> > > Subject: Discussion about NameNode Fine-grained locking >>> >> >> >> > > To: <hdfs-dev@hadoop.apache.org> >>> >> >> >> > > Cc: Zengqiang XU <xuzengqiang5...@gmail.com> >>> >> >> >> > > >>> >> >> >> > > >>> >> >> >> > > Hi everyone >>> >> >> >> > > >>> >> >> >> > > I have started a discussion about NameNode Fine-grained >>> Locking >>> >> to >>> >> >> >> improve >>> >> >> >> > > performance of write operations in NameNode. >>> >> >> >> > > >>> >> >> >> > > I started this discussion again for serval main reasons: >>> >> >> >> > > 1. We have implemented it and gained nearly 7x performance >>> >> >> >> improvement in >>> >> >> >> > > our prod environment >>> >> >> >> > > 2. Many other companies made similar improvements based on >>> their >>> >> >> >> internal >>> >> >> >> > > branch. >>> >> >> >> > > 3. This topic has been discussed for a long time, but still >>> >> without >>> >> >> >> any >>> >> >> >> > > results. >>> >> >> >> > > >>> >> >> >> > > I hope we can push this important improvement in the >>> community >>> >> so >>> >> >> >> that all >>> >> >> >> > > end-users can enjoy this significant improvement. >>> >> >> >> > > >>> >> >> >> > > I'd really appreciate you can join in and work with me to >>> push >>> >> this >>> >> >> >> > > feature forward. >>> >> >> >> > > >>> >> >> >> > > Thanks very much. >>> >> >> >> > > >>> >> >> >> > > Ticket: HDFS-17366 < >>> >> https://issues.apache.org/jira/browse/HDFS-17366> >>> >> >> >> > > Design: NameNode Fine-grained locking based on directory >>> tree >>> >> >> >> > > < >>> >> >> >> >>> >> >>> https://docs.google.com/document/d/1X499gHxT0WSU1fj8uo4RuF3GqKxWkWXznXx4tspTBLY/edit?usp=sharing >>> >> >> >> > >>> >> >> >> > > >>> >> >> >> >>> >> >> >> >>> >> --------------------------------------------------------------------- >>> >> >> >> To unsubscribe, e-mail: private-unsubscr...@hadoop.apache.org >>> >> >> >> For additional commands, e-mail: private-h...@hadoop.apache.org >>> >> >> >> >>> >> >> >> >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org >>> >> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org >>> >> >>> >> >>> >>