>Hi, 
>
>At 2024-09-22 00:12:01, "Kent Overstreet" <[email protected]> wrote:
>>On Sun, Sep 22, 2024 at 12:02:07AM GMT, David Wang wrote:
>>> Hi, 
>>> 
>>> At 2024-09-09 21:37:35, "Kent Overstreet" <[email protected]> wrote:
>>> >On Sat, Sep 07, 2024 at 06:34:37PM GMT, David Wang wrote:
>>> 
>>> >
>>> >Big standard deviation (high tail latency?) is something we'd want to
>>> >track down. There's a bunch of time_stats in sysfs, but they're mostly
>>> >for the write paths. If you're trying to identify where the latencies
>>> >are coming from, we can look at adding some new time stats to isolate.
>>> 
>>> About performance, I have a theory based on some observation I made 
>>> recently:
>>> When user space app make a 4k(8 sectors) direct write, 
>>> bcachefs would initiate a write request of ~11 sectors, including the 
>>> checksum data, right?
>>> This may not be a good offset+size pattern of block layer for performance.  
>>> (I did get a very-very bad performance on ext4 if write with 5K size.)
>>
>>The checksum isn't inline with the data, it's stored with the pointer -
>>so if you're seeing 11 sector writes, something really odd is going
>>on...
>>
>
>.... This is really contradict with my observation:
>1. fio stats yields a average 50K IOPS for a 400 seconds random direct write 
>test.
>2. from /proc/diskstatas, average "Field 5 -- # of writes completed"  per 
>second is also 50K
>(Here I conclude the performance issue is not caused by extra IOPS for 
>checksum.)
>3.  from "Field 10 -- # of milliseconds spent doing I/Os",  average disk 
>"busy" time per second is about ~0.9second, similar to the result of ext4 test.
>(Here I conclude the performance issue it not caused by not pushing disk 
>device too hard.)
>4. delta(Field 7 -- # of sectors written) / delta(Field 5 -- # of writes 
>completed)  for 5 minutes interval is 11 sectors/write.
>(This is why I draw the theory that the checksum is with raw data......I 
>thought is was a reasonable...)
>
>I will make some debug code to collect sector number patterns.
>

I collected sector numbers at the beginning of submit_bio in block/blk-core.c,
It turns out my guess was totally wrong, the user data is 8-sectors clean, the 
~11 sectors
I observed was just average sector per write. Sorry, I assumed too much, I 
thought each user write
would be companied by a checksum-write.....
And during a stress direct-4K-write test, the top-20 write sector number 
pattern is:
        +---------+------------+
        | sectors | percentage |
        +---------+------------+
        |    8    |  97.637%   |
        |    1    |   0.813%   |   
        |   510   |   0.315%   |  <== large <--journal_write_submit
        |    4    |   0.123%   |
        |    3    |   0.118%   |
        |    2    |   0.117%   |
        |   508   |   0.113%   |  <==
        |   509   |   0.094%   |  <==
        |    5    |   0.075%   |
        |    6    |   0.037%   |
        |   507   |   0.032%   |  <==
        |    14   |   0.024%   |
        |    13   |   0.020%   |
        |    11   |   0.020%   |
        |    15   |   0.020%   |
        |    10   |   0.020%   |
        |    16   |   0.018%   |
        |    12   |   0.018%   |
        |    7    |   0.017%   |
        |    20   |   0.017%   |
        +---------+------------+

btree_io write pattern, collected from btree_node_write_endio, 
is kind of uniform/flat distributed, not on block-friendly size
boundaries (I think):
        +---------+------------+
        | sectors | percentage |
        +---------+------------+
        |    1    |   9.021%   |
        |    3    |   1.440%   |
        |    4    |   1.249%   |
        |    2    |   1.157%   |
        |    5    |   0.804%   |
        |    6    |   0.409%   |
        |    14   |   0.259%   |
        |    15   |   0.253%   |
        |    16   |   0.228%   |
        |    7    |   0.226%   |
        |    11   |   0.223%   |
        |    10   |   0.223%   |
        |    13   |   0.222%   |
        |    9    |   0.213%   |
        |    12   |   0.202%   |
        |    41   |   0.194%   |
        |    17   |   0.183%   |
        |    8    |   0.182%   |
        |    18   |   0.167%   |
        |    20   |   0.167%   |
        |    19   |   0.163%   |
        |    21   |   0.160%   |
        |   205   |   0.158%   |
        |    22   |   0.145%   |
        |    23   |   0.117%   |
        |    24   |   0.093%   |
        |    51   |   0.089%   |
        |    25   |   0.080%   |
        |   204   |   0.079%   |
        +---------+------------+


Now, it seems to be that journal_io's big trunk of IO and btree_io's
irregular IO size would be the main causing factors for halving direct-4K-write
 user-io bandwidth, compared with ext4.


Maybe btree_io's irregular IO size could be regularized?

> 
>
>
>>I would suggest doing some testing with data checksums off first, to
>>isolate the issue; then it sounds like that IO pattern needs to be
>>looked at.
>
>I will try it. 

I format  partition with
`sudo bcachefs format --metadata_checksum=none --data_checksum=none 
/dev/nvme0n1p1`

It dosen't have significant help with write performance:
"IOPS=53.3k, BW=208MiB/s" --> "IOPS=55.3k, BW=216MiB/s",
and btree write's irregular IO size pattern still shows up.

But it help improve direct-4k-read performance significantly, I guess that 
would be expected
considering no extra data needs to be fetched for each read.

> 
>>
>>Check the extents btree in debugfs as well, to make sure the extents are
>>getting written out as you think they are.


Thanks
David


Reply via email to