Re: [HACKERS] [PERFORM] A Better External Sort?

Ron Peacetree Fri, 30 Sep 2005 16:44:52 -0700

25MBps should not be a CPU bound limit for IO, nor should it be
an OS limit.  It should be something ~100x (Single channel RAM)
to ~200x (dual channel RAM) that.


For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU
is suffering some combination of
A= lot's of cache misses ("cache thrash"), 
B= lot's of random rather than sequential IO (like pointer chasing)
C= lot's of wasteful copying
D= lot's of wasteful calculations

In fact, this is crappy enough performance that the whole IO layer
should be rethought and perhaps reimplemented from scratch.
Optimization of the present code is unlikely to yield a 100-200x
improvement.

On the HD side, the first thing that comes to mind is that DBs are
-NOT- like ordinary filesystems in a few ways:
1= the minimum HD IO is a record that is likely to be larger than
a HD sector.  Therefore, the FS we use should be laid out with
physical segments of max(HD sector size, record size)

2= DB files (tables) are usually considerably larger than any other
kind of files stored.  Therefore the FS we should use should be laid
out using LARGE physical pages.  64KB-256KB at a _minimum_.

3= The whole "2GB striping" of files idea needs to be rethought.
Our tables are significantly different in internal structure from the
usual FS entity.

4= I'm sure we are paying all sorts of nasty overhead for essentially
emulating the pg "filesystem" inside another filesystem.  That means
~2x as much overhead to access a particular piece of data.   

The simplest solution is for us to implement a new VFS compatible
filesystem tuned to exactly our needs: pgfs.

We may be able to avoid that by some amount of hacking or
modifying of the current FSs we use, but I suspect it would be more
work for less ROI.

Ron 


-----Original Message-----
From: Josh Berkus <josh@agliodbs.com>
Sent: Sep 30, 2005 4:41 PM
To: Ron Peacetree <[EMAIL PROTECTED]>
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

> That 11MBps was your =bulk load= speed.  If just loading a table
> is this slow, then there are issues with basic physical IO, not just
> IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this 
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been 
concerned about I/O for a while.  

Realistically, you can't do better than about 25MB/s on a single-threaded 
I/O on current Linux machines, because your bottleneck isn't the actual 
disk I/O.   It's CPU.   Databases which "go faster" than this are all, to 
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's 
another thread ... )

> As I said, the obvious candidates are inefficient physical layout
> and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and 
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're 
CPU-bound, because sort is being really inefficient about something. I 
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can 
start looking at things like async I/O, O_Direct, pre-allocation etc. that 
will give us incremental improvements.   But what we need now is a 5-10x 
improvement and that's somewhere in the algorithms or the code.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

Re: [HACKERS] [PERFORM] A Better External Sort?

Reply via email to