On 14.8.2014 18:54, Jeff Davis wrote:
> On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:
>> Either it belongs to the current batch (and either it's in the hash
>> table, or you add it there), or it's not - in that case write it to a
>> temp file.
> I think the part you left out is that you need two files per batch: one
> for the dumped-out partially-computed state values, and one for the
> tuples.
> In other words, you haven't really discussed the step where you reunite
> the tuples with that partially-computed state.

No, that's not how the serialize/deserialize should work. The aggregate
needs to store the state as-is, so that after deserializing it gets
pretty much the same thing.

For example, for 'median' the state is the list of all the values
received so far, and when serializing it you have to write all the
values out. After deserializing it, you will get the same list of values.

Some aggregates may use complex data structures that may need more
elaborate serialize.

>> For sure, it's not for free - it may write to quite a few files. Is it
>> more expensive than what you propose? I'm not sure about that. With
>> your batching scheme, you'll end up with lower number of large batches,
>> and you'll need to read and split them, possibly repeatedly. The
>> batching scheme from hashjoin minimizes this.
> My approach only has fewer batches if it elects to have fewer batches,
> which might happen for two reasons:
>  1. A cardinality misestimate. This certainly could happen, but we do
> have useful numbers to work from (we know the number of tuples and
> distincts that we've read so far), so it's far from a blind guess. 
>  2. We're concerned about the random I/O from way too many partitions.

OK. We can't really do much with the cardinality estimate.

As for the random IO concerns, I did a quick test to see how this
behaves. I used a HP ProLiant DL380 G5 (i.e. a quite old machine, from
2006-09 if I'm not mistaken). 16GB RAM, RAID10 on 6 x 10k SAS drives,
512MB write cache. So a quite lousy machine, considering today's standards.

I used a simple C program (attached) that creates N files, and writes
into them in a round-robin fashion until a particular file size is
reached. I opted for 64GB total size, 1kB writes.

    ./iotest filecount filesize writesize

File size is in MB, writesize is in bytes. So for example this writes 64
files, each 1GB, using 512B writes.

    ./iotest 64 1024 512

Measured is duration before/after fsync (in seconds):

    files   |    file size  |  before  fsync |  after fsync
    32      |      2048     |        290.16  |      294.33
    64      |      1024     |        264.68  |      267.60
    128     |       512     |        278.68  |      283.44
    256     |       256     |        332.11  |      338.45
    1024    |        64     |        419.91  |      425.48
    2048    |        32     |        450.37  |      455.20

So while there is a difference, I don't think it's the 'random I/O wall'
as usually observed on rotational drives. Also, this is 2.6.32 kernel,
and my suspicion is that with a newer one the behaviour would be better.

I also have an SSD in that machine (Intel S3700), so I did the same test
with these results:

    files   |    file size  |  before  fsync |  after fsync
    32      |      2048     |        445.05  |      464.73
    64      |      1024     |        447.32  |      466.56
    128     |       512     |        446.63  |      465.90
    256     |       256     |        446.64  |      466.19
    1024    |        64     |        511.85  |      523.24
    2048    |        32     |        579.92  |      590.76

So yes, the number of files matter, but I don't think it's strong enough
to draw a clear line on how many batches we allow. Especially
considering how old this machine is (on 3.x kernels, we usually see much
better performance in I/O intensive conditions).

>> I fail to see how this is different from your approach? How can you
>> output any tuples before processing the whole inner relation?
> Right, the only thing I avoid is scanning the hash table and dumping out
> the groups.
> This isn't a major distinction, more like "my approach does a little
> less work before returning tuples", and I'm not even sure I can defend
> that, so I'll retract this point.
>> Your approach is to do multi-level batching, and I was thinking whether
>> it'd be possible to use the same approach (single level). But in
>> retrospect it probably does not make much sense, because the multi-level
>> batching is one of the points of the proposed approach.
> Now that I think about it, many of the points we discussed could
> actually work with either approach:
>   * In my approach, if I need more partitions, I could create more in
> much the same way as HashJoin to keep it single-level (as you suggest
> above).
>   * In your approach, if there are too many partitions, you could avoid
> random I/O by intentionally putting tuples from multiple partitions in a
> single file and moving them while reading.
>   * If given a way to write out the partially-computed states, I could
> evict some groups from the hash table to keep an array_agg() bounded.
> Our approaches only differ on one fundamental trade-off that I see:
>   (A) My approach requires a hash lookup of an already-computed hash for
> every incoming tuple, not only the ones going into the hash table.
>   (B) Your approach requires scanning the hash table and dumping out the
> states every time the hash table fills up, which therefore requires a
> way to dump out the partial states.
> You could probably win the argument by pointing out that (A) is O(N) and
> (B) is O(log2(N)). But I suspect that cost (A) is very low.
> Unfortunately, it would take some effort to test your approach because
> we'd actually need a way to write out the partially-computed state, and
> the algorithm itself seems a little more complex. So I'm not really sure
> how to proceed.

I plan to work on this a bit over the next week or two. In any case,
it'll be a limited implementation, but hopefully it will be usable for
some initial testing.


#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/time.h>

int main(int argc, char * argv[]) {

    int i, j;

    int filecount;
    int filesize;
    int writesize;

    int files[2048];

    int  writecount;
    struct timeval start_time, end_time;

    if (argc < 4) {
        printf("invalid number of arguments: %d\n", argc);
        return 1;

    filecount = atoi(argv[1]);
    filesize  = atoi(argv[2]);
    writesize = atoi(argv[3]);

    printf("files=%d size=%d write=%d\n", filecount, filesize, writesize);

    if (filecount < 1) {
        printf("invalid file count (%d)\n", filecount);
        return 2;

    if (filesize < 1) {
        printf("invalid file size (%d)\n", filesize);
        return 3;

    if ((writesize < 1) || (writesize > 1024)) {
        printf("invalid write size (%d)\n", writesize);
        return 4;

    writecount = (filesize * 1024L * 1024L) / writesize;

    for (i = 0; i < filecount; i++) {

        char filename[1024];
        sprintf(filename, "iotest.%d", i);

        files[i] = creat(filename, S_IRWXU);

        if (files[i] == -1) {
            filecount = (i+1);
            goto cleanup;


    gettimeofday(&start_time, NULL);

    int x = writecount / 10;

    for (j = 0; j < writecount; j++) {

        char data[writesize];

        for (i = 0; i < writesize; i++) {
            data[i] = rand() % 256;

        for (i = 0; i < filecount; i++) {
            ssize_t l = write(files[i], data, writesize);
            if (l != writesize) {
                printf("ERROR writing data (%ld) : %d (file %d) : %s\n", l, errno, files[i], strerror(errno));
                goto cleanup;

        if (j == x) {
                printf("written %.0f %%\n", (100.0 * j) / writecount);
                x += writecount / 10;


    gettimeofday(&end_time, NULL);

    double duration = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_usec - start_time.tv_usec) / 1000000.0;

    printf("duration = %.2f\n", duration);

    for (i = 0; i < filecount; i++) {

    gettimeofday(&end_time, NULL);

    duration = (end_time.tv_sec - start_time.tv_sec) + (end_time.tv_usec - start_time.tv_usec) / 1000000.0;

    printf("duration (with fsync) = %.2f\n", duration);


    for (i = 0; i < filecount; i++) {

        char filename[1024];
        sprintf(filename, "iotest.%d", i);



return 0;

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to