Re: [HACKERS] Bumping block size to 16K on FreeBSD...

David Schultz Fri, 29 Aug 2003 01:04:00 +0000

On Thu, Aug 28, 2003, Tom Lane wrote:
> Sean Chittenden <[EMAIL PROTECTED]> writes:
> > Are there any objections
> > to me increasing the block size for FreeBSD installations to 16K for
> > the upcoming 7.4 release?
> 
> I'm a little uncomfortable with introducing a cross-platform variation
> in the standard block size.  That would have implications for things
> like whether a table definition that works on FreeBSD could be expected
> to work elsewhere; to say nothing of recommendations for shared_buffer
> settings and suchlike.
> 
> Also, there is no infrastructure for adjusting BLCKSZ automatically at
> configure time, and I don't much want to add it.


On recent versions of FreeBSD (and Solaris too, I think), the
default UFS block size is 16K, and file fragments are 2K.  This
works great for many workloads, but it kills pgsql's random write
performance unless pgsql uses 16K blocks as well, due to the
read-modify-write involved.  Either the filesystem or the database
needs to be changed in order to get decent performance.  I have
not compared 16K UFS/16K pgsql to 8K UFS/8K pgsql, so I can't say
which option makes more sense, though.  There probably isn't
anything wrong with the pgsql default, except that it's set in
stone.

It's entirely feasible for administrators to create 8K/1K UFS
filesystems specifically for pgsql, but they need to be aware of
the issue.  On the other hand, I don't see how it would be a bad
thing if pgsql were able to adapt at runtime either.  Thus, I've
come up with two possible fixes:

(1) Document the problem with having a filesystem block size
    larger than the database block size.  With a simple call to
    statvfs(2), the postmaster could warn about this on startup, too.

(2) Make BLCKSZ a runtime constant, stored as part of the database.
    Grepping through the source, I didn't see any places
    using BLCKSZ where efficiency appeared to be so critical that
    you had to have constant folding.  Of course, one could introduce
    a 'lg2blksz' constant to avoid divides and multiplies.

    This would NOT introduce cross-platform incompatibilities, only
    efficiency problems with databases that have been moved across
    filesystems in some cases.  The ability to adapt at database
    creation time is also useful in that it allows the database to
    be tuned to the characteristics of the particular device on
    which it resides.[1]

I don't know very much about pgsql, so corrections and feedback
regarding these ideas would be appreciated.


[1] Right now, the seek time to transfer time ratio of the drive
    is mostly hidden by the operating system's clustering and
    read-ahead.  I tried modifying pgsql to use direct I/O, but
    it seems that pgsql doesn't do its own clustering or read-ahead,
    so that was a lose...

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Re: [HACKERS] Bumping block size to 16K on FreeBSD...

Reply via email to