[gentoo-user] Re: {OT} Allow work from home?

Kai Krakow Sun, 06 Mar 2016 04:17:11 -0800

Am Sat, 05 Mar 2016 00:52:09 +0100
schrieb lee <[email protected]>:

> >> > It uses some very clever ideas to place files into groups and
> >> > into proper order - other than using file mod and access times
> >> > like other defrag tools do (which even make the problem worse by
> >> > doing so because this destroys locality of data even more).    
> >> 
> >> I've never heard of MyDefrag, I might try it out.  Does it make
> >> updating any faster?  
> >
> > Ah well, difficult question... Short answer: It uses countermeasures
> > against performance after updates decreasing too fast. It does this
> > by using a "gapped" on-disk file layout - leaving some gaps for
> > Windows to put temporary files. By this, files don't become a far
> > spread as usually during updates. But yes, it improves installation
> > time.  
> 
> What difference would that make with an SSD?


Well, those gapps are by a good chance a trimmed erase block, so it can
be served fast by the SSD firmware. Of course, the same applies if your
OS is using discard commands to mark free blocks and you still have
enough free space in the FS. So, actually, for SSDs it probably makes
no difference.

> > Apparently it's unmaintained since a few years but it still does a
> > good job. It was built upon a theory by a student about how to
> > properly reorganize file layout on a spinning disk to stay at high
> > performance as best as possible.  
> 
> For spinning disks, I can see how it can be beneficial.

My comment was targetted at this.

> >> > But even SSDs can use _proper_ defragmentation from time to time
> >> > for increased lifetime and performance (this is due to how the
> >> > FTL works and because erase blocks are huge, I won't get into
> >> > detail unless someone asks). This is why mydefrag also supports
> >> > flash optimization. It works by moving as few files as possible
> >> > while coalescing free space into big chunks which in turn relaxes
> >> > pressure on the FTL and allows to have more free and continuous
> >> > erase blocks which reduces early flash chip wear. A filled SSD
> >> > with long usage history can certainly gain back some performance
> >> > from this.    
> >> 
> >> How does it improve performance?  It seems to me that, for
> >> practical use, almost all of the better performance with SSDs is
> >> due to reduced latency.  And IIUC, it doesn't matter for the
> >> latency where data is stored on an SSD.  If its performance
> >> degrades over time when data is written to it, the SSD sucks, and
> >> the manufacturer should have done a better job.  Why else would I
> >> buy an SSD.  If it needs to reorganise the data stored on it, the
> >> firmware should do that.  
> >
> > There are different factors which have impact on performance, not
> > just seek times (which, as you write, is the worst performance
> > breaker):
> >
> >   * management overhead: the OS has to do more house keeping, which
> >     (a) introduces more IOPS (which is the only relevant limiting
> >     factor for SSD) and (b) introduces more CPU cycles and data
> >     structure locking within the OS routines during performing IO
> > which comes down to more CPU cycles spend during IO  
> 
> How would that be reduced by defragmenting an SSD?

FS structures are coalesced back into simpler structures by
defragmenting, e.g. btrfs creates a huge overhead by splitting extents
due to its COW nature. Doing a defrag here combines this back into
fewer extents. It's reported on the btrfs list that this CAN make a big
difference even for SSD, tho usually you only see the performance loss
with heavily fragmented files like VM images - so recommendation here
is to set those files nocow.

> >   * erasing a block is where SSDs really suck at performance wise,
> > plus blocks are essentially read-only once written - that's how
> > flash works, a flash data block needs to be erased prior to being
> >     rewritten - and that is (compared to the rest of its
> > performance) a really REALLY HUGE time factor  
> 
> So let the SSD do it when it's idle.  For applications in which it
> isn't idle enough, an SSD won't be the best solution.

That's probably true - haven't thought of this.

> >   * erase blocks are huge compared to common filesystem block sizes
> >     (erase block = 1 or 2 MB vs. file system block being 4-64k
> > usually) which happens to result in this effect:
> >
> >     - OS replaces a file by writing a new, deleting the old
> >       (common during updates), or the user deletes files
> >     - OS marks some blocks as free in its FS structures, it depends
> > on the file size and its fragmentation if this gives you a
> >       continuous area of free blocks or many small blocks scattered
> >       across the disk: it results in free space fragmentation
> >     - free space fragments happen to become small over time, much
> >       smaller then the erase block size
> >     - if your system has TRIM/discard support it will tell the SSD
> >       firmware: here, I no longer use those 4k blocks
> >     - as you already figured out: those small blocks marked as free
> > do not properly align with the erase block size - so actually, you
> >       may end up with a lot of free space but essentially no
> > complete erase block is marked as free  
> 
> Use smaller erase blocks.

It's a hardware limitation - and it's probably not going to change. I
think erase blocks will become even bigger when capacities increase.

> >     - this situation means: the SSD firmware cannot reclaim this
> > free space to do "free block erasure" in advance so if you write
> >       another block of small data you may end up with the SSD going
> >       into a direct "read/modify/erase/write" cycle instead of just
> >       "read/modify/write" and deferring the erasing until later - ah
> >       yes, that's probably becoming slow then
> >     - what do we learn: (a) defragment free space from time to time,
> >       (b) enable TRIM/discard to reclaim blocks in advance, (c) you
> > may want to over-provision your SSD: just don't ever use 10-15% of
> >       your SSD, trim that space, and leave it there for the
> > firmware to shuffle erase blocks around  
> 
> Use better firmware for SSDs.

This is a technical limitation. I don't think there's anything a
firmware could improve here - except by using internal overprovisioning
and bigger caches to defer this into idle background - but see your
comment above regarding idle time.

Problem that goes hand in hand with this: If your SSD firmware falls
back to "read/erase/modify/write" cycle, this wears the flash cells
much faster. Thus, I'd recommend to use bigger overprovisioning
depending on application and usage pattern.

> >     - the latter point also increases life-time for obvious reasons
> > as SSDs only support a limited count of write-cycles per block
> >     - this "shuffling around" blocks is called wear-levelling: the
> >       firmware chooses a block candidate with the least write cycles
> >       for doing "read/modify/write"
> >
> > So, SSDs actually do this "reorganization" as you call it - but they
> > are limited to it within the bounds of erase block sizes - and the
> > firmware knows nothing about the on-disk format and its smaller
> > blocks, so it can do nothing to go down to a finer grained
> > reorganization.  
> 
> Well, I can't help it.  I'm going to need to use 2 SSDs on a hardware
> RAID controller in a RAID-1.  I expect the SSDs to just work fine.  If
> they don't, then there isn't much point in spending the extra money on
> them.
>
> The system needs to boot from them.  So what choice do I have to make
> these SSDs happy?

Well, from OS point of view they should just work the same with
hardware and software RAID. Your RAID controller should support passing
discard commands down to the SSD - or you use bigger overprovisioning
by not assigning all space to the array configuration.

But by all means: It is worth spending the money. We are using mirrored
SSDs for LSI CacheCade configuration - the result is lightning-fast
systems. The SSD mirror just acts as a huge write-back and random
access cache for the bigger spinning RAID sets - like l2arc does for
ZFS, just at RAID controller level. This way, you can have your cake
and eat it, too: Best of both worlds - big storage + high IOPS.

> > These facts are apparently unknown to most people, that's why they
> > are denying a SSD could become slow or needs some specialized form
> > of "defragmentation". The usual recommendation is to do a "secure
> > erase" of the disk if it becomes slow - which I consider pretty
> > harmful as it rewrites ALL blocks (reducing their write-cycle
> > counter/lifetime), plus it's time consuming and could be avoided.  
> 
> That isn't an option because it would be way too much hassle.

You mean secure erase: Yes. Not an option. For different reasons.

> > BTW: OS makers (and FS designers) actually optimize their systems
> > for that kind of reorganization of the SSD firmware. NTFS may use
> > different allocation strategies on SSD (just a guess) and in Linux
> > there is F2FS which actually exploits this reorganization for
> > increased performance and lifetime, Ext4 and Btrfs use different
> > allocation strategies and prefer spreading file data instead of
> > free space (which is just the opposite of what's done for HDD). So,
> > with a modern OS you are much less prone to the effects described
> > above.  
> 
> Does F2FS come with some sort of redundancy?  Reliability and booting
> from these SSDs are requirements, so I can't really use btrfs because
> it's troublesome to boot from, and the reliability is questionable.
> Ext4 doesn't have raid.  Using ext4 on mdadm probably won't be any
> better than using the hardware RAID, so there's no point in doing
> that, and I rather spare me the overhead.

Well, you can use F2FS with mdadm. Btrfs boots just fine if you are not
using multi-device btrfs - so you have to fall back to hardware RAID or
mdadm instead of using btrfs native RAID pooling.

> After your explanation, I have to wonder even more than before what
> the point in using SSDs is, considering current hard- and software
> which doesn't properly use them.  OTOH, so far they do seem to
> provide better performance than hard disks even when not used with
> all the special precautions I don't want to have to think about.

Yes, they do. But I think there's still lot that can be done.
Developing file systems is a multi-year, if not multi-decade process.
Historically, everything is designed around spinning disk
characteristics. Of course, much has been done already to make these FS
work better with SSD: Ext4 has optimizations, btrfs was designed with
having SSD in mind, F2FS is a completely new filesystem specifically
targetted at simple flash storage (those without an FTL, read: embedded
devices) but also works great for SSD (which uses an FTL), most other
systems added some sort of caches to make use of SSDs while still
providing big storage, that is:

> BTW, why would anyone use SSDs for ZFS's zil or l2arc?  Does ZFS treat
> SSDs properly in this application?

ZFS' caches are properly designed around this, I think. Linux adds its
own l2arc/zil like caches (usable for every FS), namely bcache,
flashcache, mdcache, maybe more... I'm very confident with bcache in
writeback mode for my home system. [1]

Hardware solutions like LSI CacheCade also work very well. So, if
you're using a RAID controller anyways, consider that.

But I think all of those caches just work around the design patterns of
todays common filesystems - those can still use improvements and
optimizations. But in itself I already see it as a huge improvement.

[1]: Tho, I must say that you can wear out your SSD with bcache in
around 2 years, at least the cheaper ones. But my Win7 VM can boot in 7
seconds at best with it (btrfs-raid/bcache), tho usually it's around
15-20 seconds - an its image is bigger than my SSD. And working with
it feels no different than using Win7 natively on SDD (read: no VM,
drive C and everything on SSD). But actually, I feel it's simpler to
replace the caching-SSD due to wearing out than reinstalling the system
on a new SSD when used natively due to it's space just becoming to
small.

-- 
Regards,
Kai

Replies to list-only preferred.

[gentoo-user] Re: {OT} Allow work from home?

Reply via email to