Re: [PERFORM] RAID arrays and performance

James Mansion Tue, 04 Dec 2007 15:58:21 -0800

Mark Mielke wrote:

PostgreSQL or the kernel should already have the hottest pages inmemory, so the value of doing async I/O is very likely the coolerpages that are unique to the query. We don't know what the coolerpages are until we follow three tree down.

I'm assuming that at the time we start to search the index, we have someidea of value or values thatwe are looking for. Or, as you say, we are applying a function to 'allof it'.

Think of a 'between' query. The subset of the index that can be a matchcan be bounded by the leafpages that contain the end point(s). Similarly if we have a merge witha sorted intermediate set from

a prior step then we also have bounds on the values.

I'm not convinced that your assertion that the index leaf pages mustnecessarily be processed in-orderis true - it depends what sort of operation is under way. I am assumingthat we try hard to keepinterior index nodes and data in meory and that having identified thesubset of these that we want, we

can immediately infer the set of leaves that are potentially of interest.

The difference between preload and handling async I/O in terms ofperformance is debatable. Greg reports that async I/O on Linux sucks,but posix_fadvise*() has substantial benefits. posix_fadvise*() ispreload not async I/O (he also reported that async I/O on Solarisseems to work well). Being able to do work as the first page isavailable is a micro-optimization as far as I am concerned at thispoint (one that may not yet work on Linux), as the real benefit comesfrom utilizing all 12 disks in Matthew's case, not from guaranteeingthat data is processed as soon as possible.

I see it as part of the same problem. You can partition the data acrossall the disks and run queries in parallelagainst the partitions, or you can lay out the data in the RAID array inwhich case the optimiser has very little ideahow the data will map to physical data layout - so its best bet is tolet the systems that DO know decide theaccess strategy. And those systems can only do that if you give them alot of requests that CAN be reordered,

so they can choose a good ordering.

Micro-optimization.

Well, you like to assert this - but why? If a concern is the latency(and my experience suggests that latency is thebiggest issue in practice, not throughput per se) then overlappingprocessing while waiting for 'distant' data isimportant - and we don't have any information about the physical layoutof the data that allows us to assert thatforward access pre-read of data from a file is the right strategy foraccessing it as fast as possible - we have toallow the OS (and to an increasing extent the disks) to manage theelevator IO to best effect. Its clear that thespeed of streaming read and write of modern disks is really highcompared to that of random access, so anythingwe can do to help the disks run in that mode is pretty worthwhile evenif the physical streaming doesn't matchany obvious logical ordering of the OS files or logical data pageswithin them. If you have a function to apply toa set of data elements and the elements are independant, then requiringthat the function is applied in an orderrather than conceptually in parallel is going to put a lot of constrainton how the hardware can optimise it.

Clearly a hint to preload is better than nothing. But it seems to methat the worst case is that we wait forthe slowest page to load and then start processing hoping that the restof the data stays in the buffer cacheand is 'instant', while AIO and evaluate-when-ready means that processis still bound by the slowestdata to arrive, but at that point there is little processing still todo, and the already-processed buffers can bereused earlier. In the case where there is significant presure on thebuffer cache, this can be significant.

Of course, a couple of decades bullying Sybase systems on Sun Enterpriseboxes may have left mesomewhat jaundiced - but Sybase can at least parallelise things.Sometimes. When it does, its quite

a big win.

In your hand waving, you have assumed that PostgreSQL B-Tree indexmight need to be replaced? :-)

Sure, if the intent is to make the system thread-hot or AIO-hot, thenthe change is potentially veryinvasive. The strategy to evaluate queries based on parallel executionand async IO is not necessarily

very like a strategy where you delegate to the OS buffer cache.

I'm not too bothered for the urpose of this discussion whether the waythat postgres currently

navigates indexes is amenable to this.  This is bikeshed land, right?

I think it is foolish to disregard strategies that will allowoverlapping IO and processing - and we want tokeep disks reading and writing rather than seeking. To me that suggestsAIO and disk-native queuingare quite a big deal. And parallel evaluation will be too as the numberof cores goes up and there isan expectation that this should reduce latency of individual query, notjust allow throughput with lots

of concurrent demand.


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Re: [PERFORM] RAID arrays and performance

Reply via email to