Re: [zfs-discuss] Interesting question about L2ARC
On Sep 26, 2012, at 4:28 AM, Sašo Kiselkov wrote: > On 09/26/2012 01:14 PM, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Jim Klimov >>> >>> Got me wondering: how many reads of a block from spinning rust >>> suffice for it to ultimately get into L2ARC? Just one so it >>> gets into a recent-read list of the ARC and then expires into >>> L2ARC when ARC RAM is more needed for something else, >> >> Correct, but not always sufficient. I forget the name of the parameter, but >> there's some rate limiting thing that limits how fast you can fill the >> L2ARC. This means sometimes, things will expire from ARC, and simply get >> discarded. > > The parameters are: > > *) l2arc_write_max (default 8MB): max number of bytes written per >fill cycle It should be noted that this level was perhaps appropriate 6 years ago, when L2ARC was integrated and given the SSDs available at the time, but is well below reasonable settings for high speed systems or modern SSDs. It is probably not a bad idea to change the default to reflect more modern systems, thus avoiding surprises. -- richard > *) l2arc_headroom (default 2x): multiplies the above parameter and >determines how far into the ARC lists we will search for buffers >eligible for writing to L2ARC. > *) l2arc_feed_secs (default 1s): regular interval between fill cycles > *) l2arc_feed_min_ms (default 200ms): minimum interval between fill >cycles > > Cheers, > -- > Saso > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/26/2012 01:14 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Jim Klimov >> >> Got me wondering: how many reads of a block from spinning rust >> suffice for it to ultimately get into L2ARC? Just one so it >> gets into a recent-read list of the ARC and then expires into >> L2ARC when ARC RAM is more needed for something else, > > Correct, but not always sufficient. I forget the name of the parameter, but > there's some rate limiting thing that limits how fast you can fill the L2ARC. > This means sometimes, things will expire from ARC, and simply get discarded. The parameters are: *) l2arc_write_max (default 8MB): max number of bytes written per fill cycle *) l2arc_headroom (default 2x): multiplies the above parameter and determines how far into the ARC lists we will search for buffers eligible for writing to L2ARC. *) l2arc_feed_secs (default 1s): regular interval between fill cycles *) l2arc_feed_min_ms (default 200ms): minimum interval between fill cycles Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else, Correct, but not always sufficient. I forget the name of the parameter, but there's some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/25/2012 09:38 PM, Jim Klimov wrote: > 2012-09-11 16:29, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Dan Swartzendruber >>> >>> My first thought was everything is >>> hitting in ARC, but that is clearly not the case, since it WAS >>> gradually filling up >>> the cache device. >> >> When things become colder in the ARC, they expire to the L2ARC (or >> simply expire, bypassing the L2ARC). So it's normal to start filling >> the L2ARC, even if you never hit anything in the L2ARC. > > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else, and only > when that L2ARC fills up does the block expire from these caches > completely? > > Thanks, and sorry for a lame question ;) Correct. See https://github.com/illumos/illumos-gate/blob/14d44f2248cc2a54490db7f7caa4da5968f90837/usr/src/uts/common/fs/zfs/arc.c#L3685 for an exact description of the ARC<->L2ARC interaction mechanism. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 9/25/2012 3:38 PM, Jim Klimov wrote: 2012-09-11 16:29, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Dan Swartzendruber My first thought was everything is hitting in ARC, but that is clearly not the case, since it WAS gradually filling up the cache device. When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it's normal to start filling the L2ARC, even if you never hit anything in the L2ARC. Got me wondering: how many reads of a block from spinning rust suffice for it to ultimately get into L2ARC? Just one so it gets into a recent-read list of the ARC and then expires into L2ARC when ARC RAM is more needed for something else, and only when that L2ARC fills up does the block expire from these caches completely? Good question. I don't remember if I posted my final status, but I put in 2 128GB SSDs and it's hitting them just fine. The working set seems to be right on 110GB. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
2012-09-11 16:29, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Dan Swartzendruber My first thought was everything is hitting in ARC, but that is clearly not the case, since it WAS gradually filling up the cache device. When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it's normal to start filling the L2ARC, even if you never hit anything in the L2ARC. Got me wondering: how many reads of a block from spinning rust suffice for it to ultimately get into L2ARC? Just one so it gets into a recent-read list of the ARC and then expires into L2ARC when ARC RAM is more needed for something else, and only when that L2ARC fills up does the block expire from these caches completely? Thanks, and sorry for a lame question ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
Interesting, They are in and running and I'm hitting on them pretty good. One thing I did change: I've seen recommendations (one in particular from nexenta) that recommends an 8KB recordsize for virtualization workloads (which is 100% of my workload). I Svmotion'ed all the VMs off the datastore and back (to get new smaller recordsize.) I wonder if that has an effect? -Original Message- From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] Sent: Tuesday, September 11, 2012 10:12 AM To: Dan Swartzendruber Cc: 'James H'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 04:06 PM, Dan Swartzendruber wrote: > Thanks a lot for clarifying how this works. You're very welcome. > Since I'm quite happy > having an SSD in my workstation, I will need to purchase another SSD > :) I'm wondering if it makes more sense to buy two SSDs of half the size (e.g. > 128GB), since the total price is about the same? If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
-Original Message- From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] Sent: Tuesday, September 11, 2012 10:12 AM To: Dan Swartzendruber Cc: 'James H'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 04:06 PM, Dan Swartzendruber wrote: > Thanks a lot for clarifying how this works. You're very welcome. > Since I'm quite happy > having an SSD in my workstation, I will need to purchase another SSD > :) I'm wondering if it makes more sense to buy two SSDs of half the size (e.g. > 128GB), since the total price is about the same? If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. *** I have plenty of ports. 8 ports on an LSI HBA, one of which goes to the jbod expander/chassis, so connecting two SSDs is no issue. Thanks again... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/11/2012 04:06 PM, Dan Swartzendruber wrote: > Thanks a lot for clarifying how this works. You're very welcome. > Since I'm quite happy > having an SSD in my workstation, I will need to purchase another SSD :) I'm > wondering if it makes more sense to buy two SSDs of half the size (e.g. > 128GB), since the total price is about the same? If you have the space/ports and it costs the same, two SSDs will definitely give you better iops and throughput than a single SSD twice the size. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
-Original Message- From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] Sent: Tuesday, September 11, 2012 9:52 AM To: Dan Swartzendruber Cc: 'James H'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 03:41 PM, Dan Swartzendruber wrote: > LOL, I actually was unclear not you. I understood what you were > saying, sorry for being unclear. I have 4 disks in raid10, so my max > random read throughput is theoretically somewhat faster than the L2ARC > device, but I never really do that intensive of reads. But here's the kicker, prefetch is never random, it's always linear, so you need measure prefetch throughput against the near linear throughput of your disks. Your average 7k2 disk is capable of ~100MB/s in linear reads, so in a pair-of-mirrors scenario (raid10) you get effectively in excess of 400MB/s in prefetch throughput. ** True, and badly worded on my part. In theory, the 4 nearline SAS drives could deliver 600MB/sec, but my path to the guests is maybe 3GB (this is all running virtualized, so I can exceed gig/e speed.) The bottom line there is that no amount of read effort by guests (via the hypervisor) is going to come anywhere near the pool's capabilities. > My point was: if a guest does read > a bunch of data sequentially, that will trigger the prefetch L2ARC > code path, correct? No, when a client does a linear read, the initial buffer is random, so that makes sense to serve from the l2arc - thus it is cached in the l2arc. Subsequently ZFS detects that the client is likely to want more buffers, so it will start prefetching the following blocks on the background. Then when the client returns, they will receive the blocks from ARC. The result is that the client wasn't latency constrained to process this block, so there's no need to cache the subsequently prefetched blocks in l2arc. *** Sorry, that's what I meant by 'the prefetch l2arc code path'. E.g. the heuristics you referred to. It seems to me that if the client never wants the prefetched block it was a waste to cache it, if he does, at worst, he'll miss once, and then it will be cached, since it will have been a demand read? > If so, I *do* want that cache in L2ARC, so that a return visit from > that guest will hit as much as possible in the cache. It will be in the normal ARC cache, however, l2arc is meant to primarily accelerate the initial block hit (as noted above), not the subsequently prefetched ones (which we have time to refetch from the main pool). This covers most generic filesystem use cases as well as random-read heavy workloads (such as databases, which rarely, if ever, do linear reads). > One other > thing (I don't think I mentioned this): my entire ESXi dataset is only > like 160GB (thin provisioning in action), so it seems to me, I should > be able to fit the entire thing in L2ARC? Please try to post the output of this after you let it run on your dataset for a few minutes: $ arcstat.pl -f \ arcsz,read,dread,pread,hit%,miss%,l2size,l2read,l2hit%,l2miss% 60 It should give us a good idea of the kind of workload we're dealing with and why your L2 hits are so low. *** Thanks a lot for clarifying how this works. Since I'm quite happy having an SSD in my workstation, I will need to purchase another SSD :) I'm wondering if it makes more sense to buy two SSDs of half the size (e.g. 128GB), since the total price is about the same? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/11/2012 03:41 PM, Dan Swartzendruber wrote: > LOL, I actually was unclear not you. I understood what you were saying, > sorry for being unclear. I have 4 disks in raid10, so my max random read > throughput is theoretically somewhat faster than the L2ARC device, but I > never really do that intensive of reads. But here's the kicker, prefetch is never random, it's always linear, so you need measure prefetch throughput against the near linear throughput of your disks. Your average 7k2 disk is capable of ~100MB/s in linear reads, so in a pair-of-mirrors scenario (raid10) you get effectively in excess of 400MB/s in prefetch throughput. > My point was: if a guest does read > a bunch of data sequentially, that will trigger the prefetch L2ARC code > path, correct? No, when a client does a linear read, the initial buffer is random, so that makes sense to serve from the l2arc - thus it is cached in the l2arc. Subsequently ZFS detects that the client is likely to want more buffers, so it will start prefetching the following blocks on the background. Then when the client returns, they will receive the blocks from ARC. The result is that the client wasn't latency constrained to process this block, so there's no need to cache the subsequently prefetched blocks in l2arc. > If so, I *do* want that cache in L2ARC, so that a return > visit from that guest will hit as much as possible in the cache. It will be in the normal ARC cache, however, l2arc is meant to primarily accelerate the initial block hit (as noted above), not the subsequently prefetched ones (which we have time to refetch from the main pool). This covers most generic filesystem use cases as well as random-read heavy workloads (such as databases, which rarely, if ever, do linear reads). > One other > thing (I don't think I mentioned this): my entire ESXi dataset is only like > 160GB (thin provisioning in action), so it seems to me, I should be able to > fit the entire thing in L2ARC? Please try to post the output of this after you let it run on your dataset for a few minutes: $ arcstat.pl -f \ arcsz,read,dread,pread,hit%,miss%,l2size,l2read,l2hit%,l2miss% 60 It should give us a good idea of the kind of workload we're dealing with and why your L2 hits are so low. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
LOL, I actually was unclear not you. I understood what you were saying, sorry for being unclear. I have 4 disks in raid10, so my max random read throughput is theoretically somewhat faster than the L2ARC device, but I never really do that intensive of reads. My point was: if a guest does read a bunch of data sequentially, that will trigger the prefetch L2ARC code path, correct? If so, I *do* want that cache in L2ARC, so that a return visit from that guest will hit as much as possible in the cache. One other thing (I don't think I mentioned this): my entire ESXi dataset is only like 160GB (thin provisioning in action), so it seems to me, I should be able to fit the entire thing in L2ARC? -Original Message- From: Sašo Kiselkov [mailto:skiselkov...@gmail.com] Sent: Tuesday, September 11, 2012 9:35 AM To: Dan Swartzendruber Cc: 'James H'; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC On 09/11/2012 03:32 PM, Dan Swartzendruber wrote: > I think you may have a point. I'm also inclined to enable prefetch > caching per Saso's comment, since I don't have massive throughput - > latency is more important to me. I meant to say the exact opposite: enable prefetch caching only if your l2arc is faster (in terms of bulk throughput) than your disks. Prefetch isn't latency bound by its very definition, so it generally makes little reason to l2arc cache it. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/11/2012 03:32 PM, Dan Swartzendruber wrote: > I think you may have a point. I'm also inclined to enable prefetch caching > per Saso's comment, since I don't have massive throughput - latency is more > important to me. I meant to say the exact opposite: enable prefetch caching only if your l2arc is faster (in terms of bulk throughput) than your disks. Prefetch isn't latency bound by its very definition, so it generally makes little reason to l2arc cache it. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
I think you may have a point. I'm also inclined to enable prefetch caching per Saso's comment, since I don't have massive throughput - latency is more important to me. -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of James H Sent: Tuesday, September 11, 2012 5:09 AM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Interesting question about L2ARC Dan, If you're not already familiar with it, I find the following command useful. It shows the realtime total read commands, number hitting/missing ARC, number hitting missing L2ARC, breakdown of MRU/MFU etc. arcstat_v2.pl -f read,hits,mru,mfu,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size,mrug,mf ug 1 That version of Arcstat is from http://github.com/mharsch/arcstat For comparison, I've got about 650GB of VMs on each of my two Nexenta VSAs (16GB/240GB L2ARC). When it's just ticking over at 50-1000r/s then 99% of that is going to ARC but I'm also seeing patches where it goes 2-5k reads and I'm seeing 20-80% l2arc hits. These have been running for about a week and, given my understanding of how L2ARC fills, I'd suggest maybe leaving it to warm up longer (e.g. 1-2 weeks?) caveat: I'm a complete newbie to zfs so I could be completely wrong ;) Cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
Hmmm, but the "real hit ratio" was 68%? -Original Message- From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [mailto:opensolarisisdeadlongliveopensola...@nedharvey.com] Sent: Tuesday, September 11, 2012 8:30 AM To: Dan Swartzendruber; zfs-discuss@opensolaris.org Subject: RE: [zfs-discuss] Interesting question about L2ARC > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Dan Swartzendruber > > My first thought was everything is > hitting in ARC, but that is clearly not the case, since it WAS > gradually filling up the cache device. When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it's normal to start filling the L2ARC, even if you never hit anything in the L2ARC. > ARC Efficency: > Cache Access Total: 12324974 > Cache Hit Ratio: 87% 10826363 [Defined > State That is a REALLY high hit ratio for ARC. It sounds to me, you probably have enough ram in there, that nearly everything is being served from ARC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Dan Swartzendruber > > My first thought was everything is > hitting in ARC, but that is clearly not the case, since it WAS gradually > filling up > the cache device. When things become colder in the ARC, they expire to the L2ARC (or simply expire, bypassing the L2ARC). So it's normal to start filling the L2ARC, even if you never hit anything in the L2ARC. > ARC Efficency: > Cache Access Total: 12324974 > Cache Hit Ratio: 87% 10826363 [Defined State That is a REALLY high hit ratio for ARC. It sounds to me, you probably have enough ram in there, that nearly everything is being served from ARC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
Dan, If you're not already familiar with it, I find the following command useful. It shows the realtime total read commands, number hitting/missing ARC, number hitting missing L2ARC, breakdown of MRU/MFU etc. arcstat_v2.pl -f read,hits,mru,mfu,miss,hit%,l2read,l2hits,l2miss,l2hit%,arcsz,l2size,mrug,mfug 1 That version of Arcstat is from http://github.com/mharsch/arcstat For comparison, I've got about 650GB of VMs on each of my two Nexenta VSAs (16GB/240GB L2ARC). When it's just ticking over at 50-1000r/s then 99% of that is going to ARC but I'm also seeing patches where it goes 2-5k reads and I'm seeing 20-80% l2arc hits. These have been running for about a week and, given my understanding of how L2ARC fills, I'd suggest maybe leaving it to warm up longer (e.g. 1-2 weeks?) caveat: I'm a complete newbie to zfs so I could be completely wrong ;) Cheers, James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss