On 2017-04-17 14:34, Chris Murphy wrote:
On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:

What is a high end SSD these days? Built-in NVMe?

One with a good FTL in the firmware.  At minimum, the good Samsung EVO
drives, the high quality Intel ones, and the Crucial MX series, but probably
some others.  My choice of words here probably wasn't the best though.

It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.
What makes it even more confusing is that other than Samsung (who _only_ use their own flash and controllers), manufacturer does not map to controller choice consistently, and even two drives with the same controller may have different firmware (and thus different degrees of reliability, those OCZ drives that were such crap at data retention were the result of a firmware option that the controller manufacturer pretty much told them not to use on production devices).


So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.

Given how the 'ssd' mount option behaves and the frequency that most systemd
instances write to their journals, that's actually reasonably expected.  We
look for big chunks of free space to write into and then align to 2M
regardless of the actual size of the write, which in turn means that files
like the systemd journal which see lots of small (relatively speaking)
writes will have way more extents than they should until you defragment
them.

Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.
Ah, apologies for my misunderstanding.

These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.
There have only been a couple of changes in the write patterns that I know of, but I would double check that the values for Seal and Compress in the journald.conf file are the same, as I know for a fact that changing those does change the write patterns (not much, but they do change).


Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...

Essentially yes, but that causes all kinds of other problems.

Drat.

Admittedly most of the problems are use-case specific (you can't afford to lose transactions in a financial database for example, so it functionally has to call fsync after each transaction), but most of it stems from the fact that BTRFS is doing a lot of the same stuff that much of the 'problem' software is doing itself internally.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to