On Tue, Jan 27, 2015 at 11:08 PM, Heikki Linnakangas < hlinnakan...@vmware.com> wrote:
> On 01/28/2015 04:16 AM, Robert Haas wrote: > >> On Tue, Jan 27, 2015 at 6:00 PM, Robert Haas <robertmh...@gmail.com> >> wrote: >> >>> Now, when you did what I understand to be the same test on the same >>> machine, you got times ranging from 9.1 seconds to 35.4 seconds. >>> Clearly, there is some difference between our test setups. Moreover, >>> I'm kind of suspicious about whether your results are actually >>> physically possible. Even in the best case where you somehow had the >>> maximum possible amount of data - 64 GB on a 64 GB machine - cached, >>> leaving no space for cache duplication between PG and the OS and no >>> space for the operating system or postgres itself - the table is 120 >>> GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB >>> from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that >>> there could be some speedup from issuing I/O requests in parallel >>> instead of serially, but that is a 15x speedup over dd, so I am a >>> little suspicious that there is some problem with the test setup, >>> especially because I cannot reproduce the results. >>> >> >> So I thought about this a little more, and I realized after some >> poking around that hydra's disk subsystem is actually six disks >> configured in a software RAID5[1]. So one advantage of the >> chunk-by-chunk approach you are proposing is that you might be able to >> get all of the disks chugging away at once, because the data is >> presumably striped across all of them. Reading one block at a time, >> you'll never have more than 1 or 2 disks going, but if you do >> sequential reads from a bunch of different places in the relation, you >> might manage to get all 6. So that's something to think about. >> >> One could imagine an algorithm like this: as long as there are more >> 1GB segments remaining than there are workers, each worker tries to >> chug through a separate 1GB segment. When there are not enough 1GB >> segments remaining for that to work, then they start ganging up on the >> same segments. That way, you get the benefit of spreading out the I/O >> across multiple files (and thus hopefully multiple members of the RAID >> group) when the data is coming from disk, but you can still keep >> everyone busy until the end, which will be important when the data is >> all in-memory and you're just limited by CPU bandwidth. >> > > OTOH, spreading the I/O across multiple files is not a good thing, if you > don't have a RAID setup like that. With a single spindle, you'll just > induce more seeks. > > Perhaps the OS is smart enough to read in large-enough chunks that the > occasional seek doesn't hurt much. But then again, why isn't the OS smart > enough to read in large-enough chunks to take advantage of the RAID even > when you read just a single file? In my experience with RAID, it is smart enough to take advantage of that. If the raid controller detects a sequential access pattern read, it initiates a read ahead on each disk to pre-position the data it will need (or at least, the behavior I observe is as-if it did that). But maybe if the sequential read is a bunch of "random" reads from different processes which just happen to add up to sequential, that confuses the algorithm? Cheers, Jeff