Re: price to pay for nocow file bit?

Martin Steigerwald Sat, 10 Jan 2015 02:31:54 -0800

Am Freitag, 9. Januar 2015, 16:52:59 schrieb David Sterba:
> On Thu, Jan 08, 2015 at 02:30:36PM +0100, Lennart Poettering wrote:
> > On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote:
> > > On 01/07/2015 12:43 PM, Lennart Poettering wrote:
> > > >Currently, systemd-journald's disk access patterns (appending to the
> > > >end of files, then updating a few pointers in the front) result in
> > > >awfully fragmented journal files on btrfs, which has a pretty
> > > >negative effect on performance when accessing them.
> > >
> > > 
> > >
> > > I've been wondering if mount -o autodefrag would deal with this problem
> > > but
> > > I haven't had the chance to look into it.
> >
> > 
> >
> > Hmm, I am kinda interested in a solution that I can just implement in
> > systemd/journald now and that will then just make things work for
> > people suffering by the problem. I mean, I can hardly make systemd
> > patch the mount options of btrfs just because I place a journal file
> > on some fs...
> >
> > 
> >
> > Is "autodefrag" supposed to become a default one day?
> 
> Maybe. The option brings a performance hit because reading a block
> that's out of sequential order with it's neighbors will also require to
> read the neighbors. Then the group (like 8 blocks) will be written
> sequentially to a new location.
> 
> It's an increased read latency in the fragmented case and more stress to
> the block allocator. Practically it's not that bad for general use, eg.
> a root partition, but now it's still users' decision whether to use it
> or not.


I am concerned about flash based storage as probably not needing it and for the 
additional writes it causes.

And about free space fragmentation due to regular defragmenting. I read on XFS 
mailing list more than one, not to run xfs_fsr, the XFS online defrag tool 
regularily from a cron job, as it can make free space fragmentation worse.

And given the issues BTRFS still has with free space handling (see the thread 
I started about it and the kernel bug report 90401), I am vary of anything 
that could add more of free space fragmentation by default, especially when 
its not needed, like on an SSD.

I have

merkaba:/home/martin/.local/share/akonadi/db_data/akonadi> filefrag 
parttable.ibd
parttable.ibd: 8039 extents found

And I had this up to 40000 extents already, I did try manual defragmenting it 
with various options to look whether I see any effect:

None.

Same with desktop search database of KDE.

On my dual SSD BTRFS RAID 1 setup the amount of extents simply does not seem 
to matter at all, except for journalctl where I saw some noticable delays on 
initially calling it. But right now also there its on one hand just about one 
second – which I consider to be on the other hand much giving its a SSD RAID 
1.

But heck, the fragmentation of some of those files in there is abysmal 
considering the small size of the files:

merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> filefrag *           
              
system@00050bbcaeb23ff2-c7230ef5d29df634.journal~: 2030 extents found
system@00050be4b7106b25-a4ab21cd18c0424c.journal~: 1859 extents found
system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~: 1803 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-0000000000000001-00050bf84d2ae7be.journal:
 
1076 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22f7-00050bfb82b379f8.journal:
 
84 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22fb-00050bfb8657c8b0.journal:
 
1036 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b2693-00050c0d8075ea4b.journal:
 
1478 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4136-00050c3782b1c527.journal:
 
2 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4137-00050c378666837a.journal:
 
142 extents found
system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b414c-00050c37c7883228.journal:
 
574 extents found
system@5ee315765b1a4c6d9ed2fe833dec7094-0000000000010fdd-00050b56fa20f846.journal:
 
2309 extents found
system.journal: 783 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-0000000000011061-00050b56fa223006.journal:
 
340 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001ad624-00050ba77c734a3b.journal:
 
564 extents found
user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001b297c-00050c0d8077447c.journal:
 
105 extents found
user-1000.journal: 133 extents found
user-120.journal: 5 extents found
user-2012.journal: 2 extents found
user-65534.journal: 222 extents found

merkaba:/var/log/journal/1354039e4d4bb8de4f97ac8400000004> du -sh * | cut -
c1-72
16M     system@00050bbcaeb23ff2-c7230ef5d29df634.journal~
16M     system@00050be4b7106b25-a4ab21cd18c0424c.journal~
16M     system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-0000000000000001-00050bf84d
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22f7-00050bfb82
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b22fb-00050bfb86
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b2693-00050c0d80
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4136-00050c3782
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b4137-00050c3786
8,0M    system@2f7df24c6b70488fa9724b00ab6e6043-00000000001b414c-00050c37c7
16M     system@5ee315765b1a4c6d9ed2fe833dec7094-0000000000010fdd-00050b56fa2
8,0M    system.journal
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-0000000000011061-00050b5
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001ad624-00050ba
8,0M    user-1000@cc345f87cb404df6a9588b0b1c707007-00000000001b297c-00050c0
8,0M    user-1000.journal
3,6M    user-120.journal
8,0M    user-2012.journal
8,0M    user-65534.journal

Especially when I compare that to rsyslog:

merkaba:/var/log> filefrag messages syslog kern.log
messages: 24 extents found
syslog: 3 extents found
kern.log: 31 extents found
merkaba:/var/log> filefrag messages.1 syslog.1 kern.log.1
messages.1: 67 extents found
syslog.1: 20 extents found
kern.log.1: 78 extents found


When I see this, I wonder whether it would make sense to use two files:

1. One for the sequential appending case

2. Another one for the pointers, which could even be rewritten from scratch 
each time.


On the other hand one can claim:

Non copy on write filesystems cope well with that kind of random I/O workload 
inside a file, so BTRFS will have to cope well with that as well.

On the other hand the way systemd writes logfiles obviously didn´t take the 
copy on write nature of the BTRFS filesystem into account.

But MySQL, PostgreSQL or others also do not do this.

So to never break userspace, BTRFS would have to adapt. On the other hand I 
think it may be easier to adapt the applications, and I wonder how a database 
could perform that is specifically designed for copy on write semantics.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: price to pay for nocow file bit?

Reply via email to