Re: Large block sizes support in Linux

Tomas Vondra Fri, 22 Mar 2024 14:31:28 -0700

On 3/22/24 19:46, Bruce Momjian wrote:
> On Thu, Mar 21, 2024 at 06:46:19PM +0100, Pankaj Raghav (Samsung) wrote:
>> Hello, 
>>
>> My team and I have been working on adding Large block size(LBS)
>> support to XFS in Linux[1]. Once this feature lands upstream, we will be
>> able to create XFS with FS block size > page size of the system on Linux.
>> We also gave a talk about it in Linux Plumbers conference recently[2]
>> for more context. The initial support is only for XFS but more FSs will
>> follow later.
>>
>> On an x86_64 system, fs block size was limited to 4k, but traditionally
>> Postgres uses 8k as its default internal page size. With LBS support,
>> fs block size can be set to 8K, thereby matching the Postgres page size.
>>
>> If the file system block size == DB page size, then Postgres can have
>> guarantees that a single DB page will be written as a single unit during
>> kernel write back and not split.
>>
>> My knowledge of Postgres internals is limited, so I'm wondering if there
>> are any optimizations or potential optimizations that Postgres could
>> leverage once we have LBS support on Linux?
> 
> We have discussed this in the past, and in fact in the early years we
> thought we didn't need fsync since the BSD file system was 8k at the
> time.
> 
> What we later realized is that we have no guarantee that the file system
> will write to the device in the specified block size, and even it it
> does, the I/O layers between the OS and the device might not, since many
> devices use 512 byte blocks or other sizes.
> 

Right, but things change over time - current storage devices support
much larger sectors (LBA format), usually 4K. And if you do I/O with
this size, it's usually atomic.

AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
format, that would not need full-page writes - we always do I/O in 4k
pages, and block layer does I/O (during writeback from page cache) with
minimum guaranteed size = logical block size. 4K are great for OLTP
systems in general, it'd be even better if we didn't need to worry about
torn pages (but the tricky part is to be confident it's safe to disable
them on a particular system).

I did watch the talk linked by Pankaj, and IIUC the promise of the LBS
patches is that this benefit would extend would apply even to larger
page sizes (= fs page size). Which right now you can't even mount, but
the patches allow that. So for example it would be possible to create an
XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as
usual, and we'd know that the page cache always writes out either the
whole page or none of it. Which right now is not guaranteed to happen,
it's possible to e.g. write the page as two 4K requests, even if all
other things are set properly (drive has 4K logical/physical sectors).

At least that's my understanding ...

Pankaj, could you clarify what the guarantees provided by LBS are going
to be? the talk uses wording like "should be" and "hint" in a couple
places, and there's also stuff I'm not 100% familiar with.

If we create a filesystem with 8K blocks, and we only ever do writes
(and reads) in 8K chunks (our default page size), what guarantees that
gives us? What if the underlying device has LBA format with only 4K (or
perhaps even just 512B), how would that affect the guarantees?

The other thing is - is there a reliable way to say when the guarantees
actually apply? I mean, how would the administrator *know* it's safe to
set full_page_writes=off, or even better how could we verify this when
the database starts (and complain if it's not safe to disable FPW)?

It's easy to e.g. take a backup on one filesystem and restore it on
another one, and forget those may have different block sizes etc. I'm
not sure it's possible in a 100% reliable way (tablespaces?).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: Large block sizes support in Linux

Reply via email to