Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread Rich Freeman
On Mon, Apr 27, 2020 at 3:07 PM antlists  wrote:
>
> On 27/04/2020 17:59, Rich Freeman wrote:
> > Really though a better solution than any of this is for the filesystem
> > to be more SSD-aware and just only perform writes on entire erase
> > regions at one time.  If the drive is told to write blocks 1-32 then
> > it can just blindly erase their contents first because it knows
> > everything there is getting overwritten anyway.  Likewise a filesystem
> > could do its own wear-leveling also, especially on something like
> > flash where the cost of fragmentation is not high.  I'm not sure how
> > well either zfs or ext4 perform in these roles.  Obviously a solution
> > like f2fs designed for flash storage is going to excel here.
>
> The problem here is "how big is an erase region". I've heard comments
> that it is several megs.

I imagine most SSDs aren't that big, though SMR drives probably are
that and more.

But I agree - for anything like this to work it really needs to be a
host-managed solution ideally, or at least one where the vendor has
published specs on how to align writes/etc.

-- 
Rich



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread antlists

On 27/04/2020 17:59, Rich Freeman wrote:

Really though a better solution than any of this is for the filesystem
to be more SSD-aware and just only perform writes on entire erase
regions at one time.  If the drive is told to write blocks 1-32 then
it can just blindly erase their contents first because it knows
everything there is getting overwritten anyway.  Likewise a filesystem
could do its own wear-leveling also, especially on something like
flash where the cost of fragmentation is not high.  I'm not sure how
well either zfs or ext4 perform in these roles.  Obviously a solution
like f2fs designed for flash storage is going to excel here.


The problem here is "how big is an erase region". I've heard comments 
that it is several megs. Trying to consolidate writes into megabyte 
blocks is going to be tricky, to say the least, unless you're dealing 
with video files or hi-res photos - I think the files my camera chucks 
out are in the 10MB region ... (24MP raw...)


Cheers,
Wol



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread Rich Freeman
On Mon, Apr 27, 2020 at 12:20 PM  wrote:
>
> The kernel is keep track of all, which already has been fstrimmed and
> avoids to retrimm the same data.
> This knowledge gets lost, when the PC is powercycled or rebooted.
>

I imagine this is filesystem-specific.  When I checked the ext4 source
I didn't think to actually check whether those flags are stored on
disk vs in some kind of cache.

I wouldn't be surprised if this data is also lost by simply unmounting
the filesystem.

> I think, the value of the amount of fstrimmed data does not reflect
> the amount of data, which gets physically fstrimmed by the SSD
> controller.

Yup.  Though I'd take issue with the term "physically fstrimmed" - I
don't think that a concept like this really exists.  The only physical
operations are reading, writing, and erasing.  TRIM is really a
logical operation at its heart.

It wouldn't make sense for a TRIM to automatically trigger some kind
of erase operation all the time.  Suppose blocks 1-32 are in a single
erase group.  You send a TRIM command for block 1 only.  It makes no
sense to have the device read blocks 2-32, erase blocks 1-32, and then
write blocks 2-32 back.  That does erase block 1, but it costs a bunch
of IO and it only replicates the worst case scenario of what would
happen if you overwrote block 1 in place without trimming it first.
You might argue that now block 1 can be written later without having
to do another erase, but this is only true if the drive can remember
that it was already erased - otherwise all writes have to be preceded
with reads just to see if the block is already empty.

Maybe that is how they actually do it, but it seems like it would make
more sense for a drive to try to look for opportunities to erase
entire blocks that don't require a read first, or to try to keep these
unused areas in some kind in extents that are less expensive to track.
The drive already has to do a lot of mapping for the sake of wear
leveling.

Really though a better solution than any of this is for the filesystem
to be more SSD-aware and just only perform writes on entire erase
regions at one time.  If the drive is told to write blocks 1-32 then
it can just blindly erase their contents first because it knows
everything there is getting overwritten anyway.  Likewise a filesystem
could do its own wear-leveling also, especially on something like
flash where the cost of fragmentation is not high.  I'm not sure how
well either zfs or ext4 perform in these roles.  Obviously a solution
like f2fs designed for flash storage is going to excel here.

-- 
Rich



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread tuxic
On 04/28 03:12, Kent Fredric wrote:
> On Sun, 26 Apr 2020 18:15:51 +0200
> tu...@posteo.de wrote:
> 
> > Filesystem  Size  Used Avail Use% Mounted on
> > /dev/root   246G   45G  189G  20% /
> 
> Given that (Size - Used) is roughly 200G, it suggests to me that
> perhaps, some process somewhere is creating and deleting a lot of
> temporary files on this device (or maybe simply re-writing the same
> file multiple times)
> 
> From a userspace, this would be invisible, as the "new"  file would be
> in a new location on the disk, and the "old" file would be invisible,
> and marked "can be overwritten". 
> 
> So if you did:
> 
> for i in {0..200}; do
>cp a b
>rm a
>mv b a 
> done
> 
> Where "a" is a 1G file, I'd expect this to have a *ceiling* of 200G
> that would turn up in fstrim output, as once you reached iteration 201,
> where "can overwrite" would allow the SSD to go back and rewrite over
> the space used in iteration 1.
> 
> While the whole time, the visible disk usage in df -h would never
> exceed 46G .
> 
> I don't know if this is what is happening, I don't have an SSD and
> don't get to use fstrim.
> 
> But based on what you've said, the results aren't *too* surprising.
> 
> Though its possible the hardware has some internal magic to elide some
> writes, potentially making the "cp" action incur very few writes, which
> would show up in the smartctl data, but ext4 might not know anything
> about that, so perhaps fstrim only indicates what ext4 *tracked* as
> being cleaned, while it may have incurred much less cleanup required on
> the hardware.
> 
> That would explain the difference between smartctl and fstrim results.
> 
> Maybe compare smartctl output over time with
> /sys/fs/ext4//session_write_kbytes and see if one grows faster
> than the other? :)
> 
> My local session_write_kbytes is currently at 709G, the partition its
> for is only 552G, with 49G space, and its been booted 33 days, so "21G
> of writes a day".
> 
> And uh, lifetime_write_kbytes is about 18TB. Yikes.
> 
> ( compiling things involves a *LOT* of ephemeral data )
> 
> Also, probably don't assume the amount of free space on your partition
> is all the physical device has at its disposal to use. It seems
> possible that on the hardware, the total pool of "free blocks" is
> arbitrarily usable by the device for wear levelling, and a TRIM command
> to that device could plausibly report more blocks trimmed than your
> current partition size, depending on how its implemented.
> 
> But indeed, lots of speculation here on my part :)
> 

Hi Kent,

Thank yopu very much for your research and your explanations! :)

Due to some statements I found online I did a interesting little
experiment:

fstrim
fstrim
reboot
fstrim 

Reported amount of data for each fstrim:
200.2GiB
  0.0GiB
--
200.2GiB

The reboot seems to be worth the same amount of fstrimmed
data as one week of daily updates and recompilations. ;)

(By the way: This all happens on a ext4 filesystem).

Background according to the reports I found online:

The kernel is keep track of all, which already has been fstrimmed and
avoids to retrimm the same data.
This knowledge gets lost, when the PC is powercycled or rebooted.

I think, the value of the amount of fstrimmed data does not reflect
the amount of data, which gets physically fstrimmed by the SSD
controller.

The kernel onlu throw the information of "possible candidates for
being fstrimmed" towards the SSD controller, which is real master
behind all this.

And as you wrote: The maximum amount of "possible data for being
fstrimmed" is all free space of the filesystem.

Slightly related: 
Do you know the purpose of this value (smartctl -a ):
Data Units Read:656,599 [336 GB]
Data Units Written: 702,251 [359 GB]
Host Read Commands: 4,316,042
Host Write Commands:3,080,180

Are these the raw amount of data, I have send to the SSD?
Looks like a lot...

Cheers!
Meino





















Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread Kent Fredric
On Sun, 26 Apr 2020 18:15:51 +0200
tu...@posteo.de wrote:

> Filesystem  Size  Used Avail Use% Mounted on
> /dev/root   246G   45G  189G  20% /

Given that (Size - Used) is roughly 200G, it suggests to me that
perhaps, some process somewhere is creating and deleting a lot of
temporary files on this device (or maybe simply re-writing the same
file multiple times)

From a userspace, this would be invisible, as the "new"  file would be
in a new location on the disk, and the "old" file would be invisible,
and marked "can be overwritten". 

So if you did:

for i in {0..200}; do
   cp a b
   rm a
   mv b a 
done

Where "a" is a 1G file, I'd expect this to have a *ceiling* of 200G
that would turn up in fstrim output, as once you reached iteration 201,
where "can overwrite" would allow the SSD to go back and rewrite over
the space used in iteration 1.

While the whole time, the visible disk usage in df -h would never
exceed 46G .

I don't know if this is what is happening, I don't have an SSD and
don't get to use fstrim.

But based on what you've said, the results aren't *too* surprising.

Though its possible the hardware has some internal magic to elide some
writes, potentially making the "cp" action incur very few writes, which
would show up in the smartctl data, but ext4 might not know anything
about that, so perhaps fstrim only indicates what ext4 *tracked* as
being cleaned, while it may have incurred much less cleanup required on
the hardware.

That would explain the difference between smartctl and fstrim results.

Maybe compare smartctl output over time with
/sys/fs/ext4//session_write_kbytes and see if one grows faster
than the other? :)

My local session_write_kbytes is currently at 709G, the partition its
for is only 552G, with 49G space, and its been booted 33 days, so "21G
of writes a day".

And uh, lifetime_write_kbytes is about 18TB. Yikes.

( compiling things involves a *LOT* of ephemeral data )

Also, probably don't assume the amount of free space on your partition
is all the physical device has at its disposal to use. It seems
possible that on the hardware, the total pool of "free blocks" is
arbitrarily usable by the device for wear levelling, and a TRIM command
to that device could plausibly report more blocks trimmed than your
current partition size, depending on how its implemented.

But indeed, lots of speculation here on my part :)








pgpG5i84QkMfE.pgp
Description: OpenPGP digital signature


Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread Alan Mackenzie
Hello, Rich.

On Sun, Apr 26, 2020 at 15:29:40 -0400, Rich Freeman wrote:

[  ]

> Incidentally, in the other thread the reason that dry-run didn't
> report anything to be trimmed is that this is hard-coded:
> printf(_("%s: 0 B (dry run) trimmed on %s\n"), path, devname);
> https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L109

Thanks for looking that up!  There doesn't appear to be much point to
this misleading --dry-run option.  It seems like a good idea which never
got implemented (except in the manual).

> Otherwise the ioctl returns how much space was trimmed, and fstrim outputs 
> this.

> -- 
> Rich

-- 
Alan Mackenzie (Nuremberg, Germany).



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-27 Thread William Kenworthy



On 27/4/20 11:14 am, tu...@posteo.de wrote:

On 04/26 09:58, Rich Freeman wrote:



/ on a btrfs raid10 (1x500G and 3x120G SSD)

"fstrim -v /" about 2 hrs apart:

rattus ~ # fstrim -v /
/: 680.6 GiB (730744291328 bytes) trimmed
rattus ~ # fstrim -v /
/: 17.8 GiB (19087859712 bytes) trimmed
rattus ~ # fstrim -v /
/: 17.8 GiB (19074703360 bytes) trimmed
rattus ~ #

the last two runs took about the same time, though the first was only 
slightly longer - should have timed it properly!



BillK






Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread tuxic
On 04/26 09:58, Rich Freeman wrote:
> On Sun, Apr 26, 2020 at 9:43 PM  wrote:
> >
> > To implement a dry run with a printf() is new to me... ;)
> >
> 
> That is all they fstrim authors could do, since there is no dry-run
> option for the actual ioctl, and fstrim itself has no idea how the
> filesystem will implement it (short of re-implementing numerous
> filesystems in the program and running it on unmounted devices).  It
> seems like an fstrim dry-run is only minimally functional, though I
> guess it will test if you made any gross errors in syntax and so on.
> I don't see any reason why they couldn't have a dry-run option for the
> ioctl, but it would have to be implemented in the various filesystems.
> Really it seems like ioctl in general in the kernel isn't
> super-well-documented.  It isn't like the system call interface.  That
> is, unless I just missed some ioctl document floating around.  The
> actual list of ioctls is in the kernel includes, but this does not
> define the syntax of the 3rd parameter of the ioctl system call which
> is function-specific. The structure used by the FITRIM ioctl is in the
> includes, but not with any kind of documentation or even a
> cross-reference to associate the structure with the ioctl itself.
> 
> -- 
> Rich
> 

Hi Rich,

thanks for the explanations again.
But I think it is better not to implement a feature at all as via
printf.

For a dry run I had expected, that some checks had been implemented,
whether a non-dry run would be successfull.

For example:
When submitting 

fstrim -n /

as normal user I get:
/: 0 B (dry run) trimmed

Doing the same without dry run set I get:
fstrim: /: FITRIM ioctl failed: Operation not permitted

When doing a 

fstrim -n /home/user

as normal user, I get the same behaviour as above --
despite that /home/user is on a harddisk with no
fstrim functionality at all.

If fstrim cannot implement the above correctlu, it would
be better not to implement it all...I think.

Cheers!
Meino







Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread Rich Freeman
On Sun, Apr 26, 2020 at 9:43 PM  wrote:
>
> To implement a dry run with a printf() is new to me... ;)
>

That is all they fstrim authors could do, since there is no dry-run
option for the actual ioctl, and fstrim itself has no idea how the
filesystem will implement it (short of re-implementing numerous
filesystems in the program and running it on unmounted devices).  It
seems like an fstrim dry-run is only minimally functional, though I
guess it will test if you made any gross errors in syntax and so on.
I don't see any reason why they couldn't have a dry-run option for the
ioctl, but it would have to be implemented in the various filesystems.
Really it seems like ioctl in general in the kernel isn't
super-well-documented.  It isn't like the system call interface.  That
is, unless I just missed some ioctl document floating around.  The
actual list of ioctls is in the kernel includes, but this does not
define the syntax of the 3rd parameter of the ioctl system call which
is function-specific. The structure used by the FITRIM ioctl is in the
includes, but not with any kind of documentation or even a
cross-reference to associate the structure with the ioctl itself.

-- 
Rich



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread tuxic
On 04/26 03:29, Rich Freeman wrote:
> On Sun, Apr 26, 2020 at 12:15 PM  wrote:
> >
> > On 04/26 11:20, Rich Freeman wrote:
> > > On Sun, Apr 26, 2020 at 10:52 AM  wrote:
> > > >
> > > > Fstrim reports about 200 GiB of trimmed data.
> > > >
> > >
> > > My suggestion would be to run fstrim twice in a row and see how fast
> > > it operates and what the results are.  If the second one completes
> > > very quickly that suggests that the drive is sane.  I'd probably just
> > > run it daily in that case, but weekly is probably fine especially if
> > > the drive isn't very full.
> > >
> >
> > host:/root>fstrim -v /
> > /: 3.3 GiB (3578650624 bytes) trimmed
> > host:/root>fstrim -v /
> > /: 0 B (0 bytes) trimmed
> >
> > This time the first fstrim reports a small mount of trimmed
> > data and second one no fstrimmed data at all.
> >
> 
> Ok, I became a bit less lazy and started looking at the source.
> 
> All fstrim does is send an FITRIM ioctl to the kernel for the device.
> This is implemented in a filesystem-dependent manner, and I couldn't
> actually find any documentation on it (actual documentation on the
> ioctl - not the fstrim manpage/etc).  A quick glimpse at the ext4
> source suggests that ext4 has a flag that can track whether a group of
> blocks has been trimmed yet or not since it was last deallocated.  So
> ext4 will make repeated fstrim runs a no-op and the drive won't see
> these.
> 
> At least, that was what I got after about 5-10min of browsing.  I
> didn't take the time to grok how ext4 tracks free space and so on.
> 
> Incidentally, in the other thread the reason that dry-run didn't
> report anything to be trimmed is that this is hard-coded:
> printf(_("%s: 0 B (dry run) trimmed on %s\n"), path, devname);
> https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L109
> 
> Otherwise the ioctl returns how much space was trimmed, and fstrim outputs 
> this.
> 
> -- 
> Rich
> 

Hi Rich,

thank you very much for digging into the depth of the sources and for 
explaining it!!! Very appreciated !!! :) :)

To implement a dry run with a printf() is new to me... ;)

Cheers!
Meino






Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread Rich Freeman
On Sun, Apr 26, 2020 at 12:15 PM  wrote:
>
> On 04/26 11:20, Rich Freeman wrote:
> > On Sun, Apr 26, 2020 at 10:52 AM  wrote:
> > >
> > > Fstrim reports about 200 GiB of trimmed data.
> > >
> >
> > My suggestion would be to run fstrim twice in a row and see how fast
> > it operates and what the results are.  If the second one completes
> > very quickly that suggests that the drive is sane.  I'd probably just
> > run it daily in that case, but weekly is probably fine especially if
> > the drive isn't very full.
> >
>
> host:/root>fstrim -v /
> /: 3.3 GiB (3578650624 bytes) trimmed
> host:/root>fstrim -v /
> /: 0 B (0 bytes) trimmed
>
> This time the first fstrim reports a small mount of trimmed
> data and second one no fstrimmed data at all.
>

Ok, I became a bit less lazy and started looking at the source.

All fstrim does is send an FITRIM ioctl to the kernel for the device.
This is implemented in a filesystem-dependent manner, and I couldn't
actually find any documentation on it (actual documentation on the
ioctl - not the fstrim manpage/etc).  A quick glimpse at the ext4
source suggests that ext4 has a flag that can track whether a group of
blocks has been trimmed yet or not since it was last deallocated.  So
ext4 will make repeated fstrim runs a no-op and the drive won't see
these.

At least, that was what I got after about 5-10min of browsing.  I
didn't take the time to grok how ext4 tracks free space and so on.

Incidentally, in the other thread the reason that dry-run didn't
report anything to be trimmed is that this is hard-coded:
printf(_("%s: 0 B (dry run) trimmed on %s\n"), path, devname);
https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L109

Otherwise the ioctl returns how much space was trimmed, and fstrim outputs this.

-- 
Rich



Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread tuxic
On 04/26 11:20, Rich Freeman wrote:
> On Sun, Apr 26, 2020 at 10:52 AM  wrote:
> >
> > Fstrim reports about 200 GiB of trimmed data.
> >
> > From the gut this looks quite a lot -- the whole
> > partition is 256 GB in size.
> >
> > Smartclt report for the drive:
> > Data Units Written: 700,841 [358 GB]
> >
> > Each week 200 GiB fstrimmed data for a partition of
> > 256 GB in size and since the beginning I have written
> > only 358 GB to it.
> >
> > How does this all fit together?
> 
> It doesn't fit together, because the amount of space trimmed has
> nothing to do with the amount of data written.
> 
> How much free space is there?  I would think that fstrim would just
> trim all unused blocks on the filesystem.  Unless it maintained state
> it would have no idea what has changed since the last time it was run,
> so if you ran it 10 times in a row it would trim 200GiB each time.
> 
> Unless your NVMe is brain-dead the only real downside to running it
> more often is the IO.  If you trim 200GiB of data 100x in a row the
> 99x after the first one should all be no-ops if the drive is
> well-designed.  An fstrim should just be a metadata operation.
> 
> Now, not all flash storage is equally well-implemented, and I suspect
> the guidelines to avoid running it often or using discard settings are
> from those who either have really cheap drives, or ones from a long
> time ago.  A lot of linux advice tends to be based on what people did
> 10+years ago, and a lot of linux design decisions get made to
> accommodate the guy who wants everything to work fine on his 386+ISA
> and SGI Indigo in his basement.
> 
> My suggestion would be to run fstrim twice in a row and see how fast
> it operates and what the results are.  If the second one completes
> very quickly that suggests that the drive is sane.  I'd probably just
> run it daily in that case, but weekly is probably fine especially if
> the drive isn't very full.
> 
> -- 
> Rich
> 

Hi Rich, 

thanks for explanation.

My observations does not fit with your explanation, though.

Early in the morning I did a fstrim, which results in the
200GiB of freed data.

Base on you posting I did a fstrim now with no
wait in between:

host:/root>fstrim -v /
/: 3.3 GiB (3578650624 bytes) trimmed
host:/root>fstrim -v /
/: 0 B (0 bytes) trimmed

This time the first fstrim reports a small mount of trimmed
data and second one no fstrimmed data at all.

The SSD is a 
ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State 
Drive (rev 03)
(cut'n'paste from `lspci`)

host:/root>df -h /
Filesystem  Size  Used Avail Use% Mounted on
/dev/root   246G   45G  189G  20% /


Cheers!
Meino






Re: [gentoo-user] "Amount" of fstrim? (curiosity driven, no paranoia :)

2020-04-26 Thread Rich Freeman
On Sun, Apr 26, 2020 at 10:52 AM  wrote:
>
> Fstrim reports about 200 GiB of trimmed data.
>
> From the gut this looks quite a lot -- the whole
> partition is 256 GB in size.
>
> Smartclt report for the drive:
> Data Units Written: 700,841 [358 GB]
>
> Each week 200 GiB fstrimmed data for a partition of
> 256 GB in size and since the beginning I have written
> only 358 GB to it.
>
> How does this all fit together?

It doesn't fit together, because the amount of space trimmed has
nothing to do with the amount of data written.

How much free space is there?  I would think that fstrim would just
trim all unused blocks on the filesystem.  Unless it maintained state
it would have no idea what has changed since the last time it was run,
so if you ran it 10 times in a row it would trim 200GiB each time.

Unless your NVMe is brain-dead the only real downside to running it
more often is the IO.  If you trim 200GiB of data 100x in a row the
99x after the first one should all be no-ops if the drive is
well-designed.  An fstrim should just be a metadata operation.

Now, not all flash storage is equally well-implemented, and I suspect
the guidelines to avoid running it often or using discard settings are
from those who either have really cheap drives, or ones from a long
time ago.  A lot of linux advice tends to be based on what people did
10+years ago, and a lot of linux design decisions get made to
accommodate the guy who wants everything to work fine on his 386+ISA
and SGI Indigo in his basement.

My suggestion would be to run fstrim twice in a row and see how fast
it operates and what the results are.  If the second one completes
very quickly that suggests that the drive is sane.  I'd probably just
run it daily in that case, but weekly is probably fine especially if
the drive isn't very full.

-- 
Rich