Mark Mielke wrote:
PostgreSQL or the kernel should already have the hottest pages in memory, so the value of doing async I/O is very likely the cooler pages that are unique to the query. We don't know what the cooler pages are until we follow three tree down.

I'm assuming that at the time we start to search the index, we have some idea of value or values that we are looking for. Or, as you say, we are applying a function to 'all of it'.

Think of a 'between' query. The subset of the index that can be a match can be bounded by the leaf pages that contain the end point(s). Similarly if we have a merge with a sorted intermediate set from
a prior step then we also have bounds on the values.

I'm not convinced that your assertion that the index leaf pages must necessarily be processed in-order is true - it depends what sort of operation is under way. I am assuming that we try hard to keep interior index nodes and data in meory and that having identified the subset of these that we want, we
can immediately infer the set of leaves that are potentially of interest.
The difference between preload and handling async I/O in terms of performance is debatable. Greg reports that async I/O on Linux sucks, but posix_fadvise*() has substantial benefits. posix_fadvise*() is preload not async I/O (he also reported that async I/O on Solaris seems to work well). Being able to do work as the first page is available is a micro-optimization as far as I am concerned at this point (one that may not yet work on Linux), as the real benefit comes from utilizing all 12 disks in Matthew's case, not from guaranteeing that data is processed as soon as possible.

I see it as part of the same problem. You can partition the data across all the disks and run queries in parallel against the partitions, or you can lay out the data in the RAID array in which case the optimiser has very little idea how the data will map to physical data layout - so its best bet is to let the systems that DO know decide the access strategy. And those systems can only do that if you give them a lot of requests that CAN be reordered,
so they can choose a good ordering.

Micro-optimization.

Well, you like to assert this - but why? If a concern is the latency (and my experience suggests that latency is the biggest issue in practice, not throughput per se) then overlapping processing while waiting for 'distant' data is important - and we don't have any information about the physical layout of the data that allows us to assert that forward access pre-read of data from a file is the right strategy for accessing it as fast as possible - we have to allow the OS (and to an increasing extent the disks) to manage the elevator IO to best effect. Its clear that the speed of streaming read and write of modern disks is really high compared to that of random access, so anything we can do to help the disks run in that mode is pretty worthwhile even if the physical streaming doesn't match any obvious logical ordering of the OS files or logical data pages within them. If you have a function to apply to a set of data elements and the elements are independant, then requiring that the function is applied in an order rather than conceptually in parallel is going to put a lot of constraint on how the hardware can optimise it.

Clearly a hint to preload is better than nothing. But it seems to me that the worst case is that we wait for the slowest page to load and then start processing hoping that the rest of the data stays in the buffer cache and is 'instant', while AIO and evaluate-when-ready means that process is still bound by the slowest data to arrive, but at that point there is little processing still to do, and the already-processed buffers can be reused earlier. In the case where there is significant presure on the buffer cache, this can be significant.

Of course, a couple of decades bullying Sybase systems on Sun Enterprise boxes may have left me somewhat jaundiced - but Sybase can at least parallelise things. Sometimes. When it does, its quite
a big win.

In your hand waving, you have assumed that PostgreSQL B-Tree index might need to be replaced? :-)

Sure, if the intent is to make the system thread-hot or AIO-hot, then the change is potentially very invasive. The strategy to evaluate queries based on parallel execution and async IO is not necessarily
very like a strategy where you delegate to the OS buffer cache.

I'm not too bothered for the urpose of this discussion whether the way that postgres currently
navigates indexes is amenable to this.  This is bikeshed land, right?

I think it is foolish to disregard strategies that will allow overlapping IO and processing - and we want to keep disks reading and writing rather than seeking. To me that suggests AIO and disk-native queuing are quite a big deal. And parallel evaluation will be too as the number of cores goes up and there is an expectation that this should reduce latency of individual query, not just allow throughput with lots
of concurrent demand.


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Reply via email to