On 14.8.2014 21:47, Tomas Vondra wrote: > On 14.8.2014 18:54, Jeff Davis wrote: >> On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote: >>> Either it belongs to the current batch (and either it's in the hash >>> table, or you add it there), or it's not - in that case write it to a >>> temp file. >> >> I think the part you left out is that you need two files per batch: one >> for the dumped-out partially-computed state values, and one for the >> tuples. >> >> In other words, you haven't really discussed the step where you reunite >> the tuples with that partially-computed state. > > No, that's not how the serialize/deserialize should work. The aggregate > needs to store the state as-is, so that after deserializing it gets > pretty much the same thing. > > For example, for 'median' the state is the list of all the values > received so far, and when serializing it you have to write all the > values out. After deserializing it, you will get the same list of values. > > Some aggregates may use complex data structures that may need more > elaborate serialize. > >>> For sure, it's not for free - it may write to quite a few files. Is it >>> more expensive than what you propose? I'm not sure about that. With >>> your batching scheme, you'll end up with lower number of large batches, >>> and you'll need to read and split them, possibly repeatedly. The >>> batching scheme from hashjoin minimizes this. >> >> My approach only has fewer batches if it elects to have fewer batches, >> which might happen for two reasons: >> 1. A cardinality misestimate. This certainly could happen, but we do >> have useful numbers to work from (we know the number of tuples and >> distincts that we've read so far), so it's far from a blind guess. >> 2. We're concerned about the random I/O from way too many partitions. > > OK. We can't really do much with the cardinality estimate. > > As for the random IO concerns, I did a quick test to see how this > behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from > 2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives, > 512MB write cache. So a quite lousy machine, considering today's standards. > > I used a simple C program (attached) that creates N files, and writes > into them in a round-robin fashion until a particular file size is > reached. I opted for 64GB total size, 1kB writes. > > ./iotest filecount filesize writesize > > File size is in MB, writesize is in bytes. So for example this writes 64 > files, each 1GB, using 512B writes. > > ./iotest 64 1024 512 > > Measured is duration before/after fsync (in seconds): > > files | file size | before fsync | after fsync > --------------------------------------------------------- > 32 | 2048 | 290.16 | 294.33 > 64 | 1024 | 264.68 | 267.60 > 128 | 512 | 278.68 | 283.44 > 256 | 256 | 332.11 | 338.45 > 1024 | 64 | 419.91 | 425.48 > 2048 | 32 | 450.37 | 455.20 > > So while there is a difference, I don't think it's the 'random I/O wall' > as usually observed on rotational drives. Also, this is 2.6.32 kernel, > and my suspicion is that with a newer one the behaviour would be better. > > I also have an SSD in that machine (Intel S3700), so I did the same test > with these results: > > files | file size | before fsync | after fsync > --------------------------------------------------------- > 32 | 2048 | 445.05 | 464.73 > 64 | 1024 | 447.32 | 466.56 > 128 | 512 | 446.63 | 465.90 > 256 | 256 | 446.64 | 466.19 > 1024 | 64 | 511.85 | 523.24 > 2048 | 32 | 579.92 | 590.76 > > So yes, the number of files matter, but I don't think it's strong enough > to draw a clear line on how many batches we allow. Especially > considering how old this machine is (on 3.x kernels, we usually see much > better performance in I/O intensive conditions).
And just for fun, I did the same test on a workstation with 8GB of RAM, S3700 SSD, i5-2500 CPU and kernel 3.12. That is, a more modern hardware / kernel / ... compared to the machine above. For a test writing 32GB of data (4x the RAM), I got these results: files | file size | before fsync | after fsync ------------------------------------------------------ 32 | 1024 | 171.27 | 175.96 64 | 512 | 165.57 | 170.12 128 | 256 | 165.29 | 169.95 256 | 128 | 164.69 | 169.62 512 | 64 | 163.98 | 168.90 1024 | 32 | 165.44 | 170.50 2048 | 16 | 165.97 | 171.35 4096 | 8 | 166.55 | 173.26 So, no sign of slowdown at all, in this case. I don't have a rotational disk in the machine at this moment, so I can't repeat the test. But I don't expect the impact to be much worse than for the old machine. I'm not sure whether this proves we should not worry about the number of batches at all - the old kernel / machines will be with us for some time. However, I'm not a fan of artificialy limiting the implementation because of a decade old machines either. Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers