Hi, ----- Original Message ----- | > This patch adds logic to function gfs2_rgrp_congested to check | > inter-node and intra-node congestion. It checks for intra-node | > (local node process) congestion first, since that's faster. | > | > If there are other processes actively using this rgrp (i.e. if | > they have reservations or are holding its glock) it's considered | > congested in the intra-node sense. | My comments in relation to the previous patch still stand. This is still | the wrong answer to the problem. If we have lock contention here, then | we should be looking to fix the locking and not trying to avoid the | issue by choosing a different rgrp. | | If we are deallocating blocks, we have no choice over which rgrp we | lock, since we must lock the rgrp that contains the blocks we want to | deallocate. The same problem exists there too, so lets fix this properly | as per my previous comments, | | Steve.
Some of the relevant comments from the previous were: | This is the wrong solution I think. We should not be choosing other | rgrps for allocation due to local contention. That is just likely to | land up creating more fragmentation by spreading things out across the | disk. The issue is that we can't currently separate the locking into dlm | (exclusive) and local (shared) but important data structures spinlock | protected, so that we land up with lock contention on the rgrp and | everything serialized via the glock. At the moment the glock state | machine only supports everything (dlm & local) shared or everything | exclusive. | | If we want to resolve this issue, then we need to deal with that issue | first and allow multiple local users of the same rgrp to co-exist with | minimal lock contention. That requires the glock state machine to | support that particular lock mode first though, | | Steve. I'd like to hear more about how you envision this to work (see below). If you study the problem with glocktop, you will see that in the intra-node congestion case (for example, ten processes on one node all doing iozone, dd, or other heavy user of the block allocator) most of the processes are stuck waiting for various rgrp glocks. That's the problem. My proposed solution was to reduce or eliminate that contention by letting each process stick to a unique rgrp for block assignments. This is pretty much what RHEL5 and RHEL6 did when they were using the "try locks" for rgrps, and that's exactly why those older releases performed significantly faster with those tests than RHEL7 and upstream. It's true that the block deallocator suffers from the same issue, and my proposed solution alleviates that too by spreading the blocks to multiple rgrps. Blocks may be both allocated and deallocated simultaneously because the glocks are unique. This is also not too different from how we gained performance boosts with the Orlov algorithm to assign unique rgrps to groups of files, which also spread the distance of the writes. This will not increase file fragmentation because that is all mitigated by the multi-block allocation algorithms we have in place. The file system itself may incur more fragmentation as now files may span a greater distance with writes to unique rgrps, but that is no different from RHEL5, RHEL6, and older releases, which acquired their rgrp glocks with "try locks" which yielded basically the same results. I'm not at all concerned about "head bouncing" performance issues because those are pretty much absorbed by almost all modern storage arrays. That's what the performance guys, and my own testing, tell me. Head-bounce problems are also mitigated by many modern hard drives as well. And let's face it: very few people are going to use GFS2 with hard drives that have head- bounce performance problems; GFS2 is mostly used on storage arrays. That's not to say there isn't a better solution. We've talked in the past about possibly allowing multiple processes to share the same rgrp for block allocations while the rgrp is exclusively locked (held) on that node. We can, for example, have unique reservations, and as long as the reservations don't conflict, we can safely assign blocks from them simultaneously. However, we will still get into sticky situations and need some kind of process-exclusive lock on the rgrp itself. For example, only one process may assign fields like rg_free, rg_inodes, and so forth. Maybe we can mitigate that with atomic variables and cpu boundaries. For the deallocation path, we can, for example, pre-construct a "multi-block reservation" structure for an existing file, then deallocate from that with the knowledge that other processes will avoid blocks touched by the reservations. But at some point, we will need a process-exclusive lock (spin_lock, rwsem, mutex, etc.) and that may prove to have a similar contention problem. This may require a fair amount of design to achieve both correctness and decent performance. I've also thought about assigning unique glocks for each bitmap of the rgrp rather than the entire rgrp. We could keep assign each writer a unique bitmap, but that's probably not good enough for two reasons: (1) writers to non-zero-index bitmaps still need to write to the rgrp block itself to adjust its used, free, and dinode counts. (2) this would only scale to the number of bitmaps, which is typically 5 or so, so the ten-writer tests would still contend in a lot of cases. Still, it would help "large (e.g. 2GB) rgrp" performance where you typically have 13 or so bitmaps. Since the glocks are kind of like a layer of caching for dlm locks, the idea of possibly having more than one "EX" state for a glock which both map to the same DLM state is intriguing. After all, rgrp glocks (like all glocks) have unique "glops" that we might be able to leverage for that purpose. I'll have to give that some thought, although the "multi-block reservations" concept still seems like a better fit. Perhaps we should do this in two-phases: (1) Implement my current design as a short-term solution which has been proven to increase performance by 20-50 percent for some use cases and gets the block allocator performance back to where it was in RHEL6. (And by the same token, helps the deallocator for the same reasons.) This is a stop-gap measure while we work on a longer-term solution that's likely to involve some design work and therefore take more time to perfect. (2) Design and implement a longer-term solution that allows for multiple processes to share an rgrp while its glock is held in EXclusive mode. As always, I'm open to more ideas on how to improve this. Regards, Bob Peterson
