Re: Btrfs/SSD

Austin S. Hemmelgarn Mon, 17 Apr 2017 12:26:40 -0700

On 2017-04-17 14:34, Chris Murphy wrote:

On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:

What is a high end SSD these days? Built-in NVMe?


One with a good FTL in the firmware.  At minimum, the good Samsung EVO
drives, the high quality Intel ones, and the Crucial MX series, but probably
some others.  My choice of words here probably wasn't the best though.


It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.

What makes it even more confusing is that other than Samsung (who _only_use their own flash and controllers), manufacturer does not map tocontroller choice consistently, and even two drives with the samecontroller may have different firmware (and thus different degrees ofreliability, those OCZ drives that were such crap at data retention werethe result of a firmware option that the controller manufacturer prettymuch told them not to use on production devices).

So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.


Given how the 'ssd' mount option behaves and the frequency that most systemd
instances write to their journals, that's actually reasonably expected.  We
look for big chunks of free space to write into and then align to 2M
regardless of the actual size of the write, which in turn means that files
like the systemd journal which see lots of small (relatively speaking)
writes will have way more extents than they should until you defragment
them.


Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

Ah, apologies for my misunderstanding.


These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.

There have only been a couple of changes in the write patterns that Iknow of, but I would double check that the values for Seal and Compressin the journald.conf file are the same, as I know for a fact thatchanging those does change the write patterns (not much, but they dochange).

Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...


Essentially yes, but that causes all kinds of other problems.


Drat.

Admittedly most of the problems are use-case specific (you can't affordto lose transactions in a financial database for example, so itfunctionally has to call fsync after each transaction), but most of itstems from the fact that BTRFS is doing a lot of the same stuff thatmuch of the 'problem' software is doing itself internally.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

Reply via email to