On Feb 10, 2014, at 11:26 AM, Prakash Surya <[email protected]> wrote: > -- > Cheers, Prakash > > On Sun, Feb 09, 2014 at 10:28:44PM -0800, Richard Elling wrote: >> Hi Prakash, >> >> On Feb 7, 2014, at 10:41 AM, Prakash Surya <[email protected]> wrote: >> >>> Hey guys, >>> >>> I've been working on some ARC performance work targeted for the ZFS on >>> Linux implementation, but I think some of the patches I'm proposing >>> _might_ be useful in the other implementations as well. >>> >>> As far as I know, the ARC code is largely the same between >>> implementations. >> >> NB, there are several different implementations use different metadata >> management approaches. >> >>> Although, on Linux we try and maintain a hard limit on >>> metadata using "arc_meta_limit" and "arc_meta_used". Thus, not all of >>> the patches are relevant outside of ZoL, but my hunch is many definitely >>> are. >> >> Can you explain the reasoning here? Historically, we've tried to avoid >> putting absolute limits because they must be managed and increasing >> management complexity is a bad idea. > > Honestly, I don't particularly like the distinction made between "data" > and "metadata" in the ARC. I haven't seen any reason why it's needed, > but it was introduced long ago, and I assume there was a reason for it > (See illumos commit: 0e8c61582669940ab28fea7e6dd2935372681236).
There is a large amount of experience in the field where some workloads are data-intensive and others are attribute-intensive. At LLNL you probably mostly see data-intensive workloads, but attribute-intensive workloads are quite common in the commercial application space. There is no doubt that some workloads perform much better when properly matched to the caching strategy. > > So, with that said, I'm not adding the "arc_meta_limit". It's already > been in the code for awhile, although the Illumos tree and the ZoL tree > differ in their behavior when that limit is reached. The reason _why_ we > differ isn't cut and dry, and might be simply due to a misunderstanding. > > I tend to think that maintaining an absolute limit on the metadata is a > good thing, but solely because we have this arbitrary notion baked into > the eviction processes that "metadata" is more important that "data". I > think that given a specific workload (I haven't tested this), the ARC > could fill up with "not as important" metadata because the data is > always pitched first (whether or not the data is getting more ghost hits > or not). > >> >>> To highlight, I think these might be of particular interest: >>> >>> * 22be556 Disable aggressive arc_p growth by default >> >> MRU (p) growth is the result of demand, yes? > > What do you mean by "result of demand"? ZFS doesn't decide to grow MRU on its own, that is driven by the application's demand. > > According to the paper the implementation was based on, "p" should > increase as the MRU list receives ghost hits. The actual implementation > diverges from the paper in a number of places. If you see lots of ghost use, then something is imbalanced. For many workloads you should see few or no ghost hits. > > In this case, "p" is incremented as we add new anonymous data in > arc_get_data_buf(). It looks like this is an optimization that should > only be done when the ARC is still "warming up", but that's not how it > works in practice. > > What I've seen is, the new anonymous data pushes "p" up to the upper > limit (due to a constant stream of new dirty data), and this throttles > the MFU. So, even though the MFU will get an order of magnitude more > ghost list hits, "p" wont properly adjust because the dirty data is > pushing "p" up to its maximum. That's why I added that patch. Again, this is result of demand. > >> >>> * 5694b53 Allow "arc_p" to drop to zero or grow to "arc_c" >> >> Zero p means zero demand? >> Also, can you explain the reasoning for not wanting anything in the >> MFU cache? I suppose if you totally disable the MFU cache, then you'll >> get the behaviour of most other file system caches, but that isn't a >> good thing. > > Again, what do you mean by "demand"? > > I definitely **do not** want to disable the MFU. I'm not sure where you > got the idea that I'm trying to keep data out of the MFU, because that's > not what I'm doing at all. > > This patch is simply removing this arbitrary limit on the max and min > size of "p". If the workload is driving "p" up or down, I don't > understand why we need to try and override the adaptive logic by placing > a min and max value. > >> >>> * 517a0bc Disable arc_p adapt dampener by default >>> * 2d1f779 Remove "arc_meta_used" from arc_adjust calculatio >>> * 32a96d6 Prioritize "metadata" in arc_get_data_buf >>> * b3b7236 Split "data_size" into "meta" and "data" >>> >>> Keep in mind, my expertise with the ARC is still limited, so if anybody >>> finds any of these patches as "wrong" (for a particular workload, maybe) >>> please let me know. The full patch stack I'm proposing on Linux is here: >>> >>> * https://github.com/zfsonlinux/zfs/pull/2110 >>> >>> I posted some graphs of useful arcstat parameters vs. time for each of >>> the 14 unique tests run. Those are in this comment: >>> >>> * https://github.com/zfsonlinux/zfs/pull/2110#issuecomment-34393733 >>> >>> And here's a snippet from the pull request description with a summary of >>> the benefits this patch stack has shown in my testing (go check out the >>> pull request for more info on the tests run and results gathered): >>> >>> Improve ARC hit rate with metadata heavy workloads >>> >>> This stack of patches has been empirically shown to drastically improve >>> the hit rate of the ARC for certain workloads. As a result, fewer reads >>> to disk are required, which is generally a good thing and can >>> drastically improve performance if the workload is disk limited. >>> >>> For the impatient, I'll summarize the results of the tests performed: >>> >>> * Test 1 - Creating many empty directories. This test saw 99.9% >>> fewer reads and 12.8% more inodes created when running >>> *with* these changes. >>> >>> * Test 2 - Creating many empty files. This test saw 4% fewer reads >>> and 0% more inodes created when running *with* these >>> changes. >>> >>> * Test 3 - Creating many 4 KiB files. This test saw 96.7% fewer >>> reads and 4.9% more inodes created when running *with* >>> these changes. >>> >>> * Test 4 - Creating many 4096 KiB files. This test saw 99.4% fewer >>> reads and 0% more inodes created (but took 6.9% fewer >>> seconds to complete) when running *with* these changes. >>> >>> * Test 5 - Rsync'ing a dataset with many empty directories. This >>> test saw 36.2% fewer reads and 66.2% more inodes created >>> when running *with* these changes. >>> >>> * Test 6 - Rsync'ing a dataset with many empty files. This test saw >>> 30.9% fewer reads and 0% more inodes created (but took >>> 24.3% fewer seconds to complete) when running *with* >>> these changes. >>> >>> * Test 7 - Rsync'ing a dataset with many 4 KiB files. This test saw >>> 30.8% fewer reads and 173.3% more inodes created when >>> running *with* these changes. >> >> AIUI, the tests will work better with a large, MFU metadata cache. >> Yet the proposed changes can also result in small, MRU-only metadata >> caches -- which would be disasterous to most (all?) applications. >> I'd love to learn more about where you want to go with this. > > Hmm, I don't quite understand your comment? I'm not trying to disable > the MFU at all? If anything, I'm trying to make sure it works on ZoL > (which I don't think it does at the moment). Depending on the workload, > the MFU is *very* useful, and it's especially useful in the tests I > looked at. Perhaps the way you are describing your changes is causing confusion. There are some things I do like, such as separating the accounting for data vs metadata. But I'm not convinced the balancing changes are generally applicable and your tests are very specific to one type of workload. Let's see how it works given wider exposure before pushing upstream. -- richard > >> -- richard >> >>> >>> So, in the interest of collaboration (and potentially getting much >>> needed input from people with more ARC expertise than I have), I wanted >>> to give this work a broader audience. >>> >>> -- >>> Cheers, Prakash >>> >>> _______________________________________________ >>> developer mailing list >>> [email protected] >>> http://lists.open-zfs.org/mailman/listinfo/developer -- [email protected] +1-760-896-4422
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
