Hello Heikki,

Thanks for having a look at the patch.

* I think we should drop the "flush" part of this for now. It's not as clearly beneficial as the sorting part, and adds a great deal more code complexity. And it's orthogonal to the sorting patch, so we can deal with it separately.

I agree that it is orthogonal and that the two features could be in distinct patches. The flush part is the first patch I really submitted because it has significant effect on latency, and I was told to mix it with sorting...

The flushing part really helps to keep "write stalls" under control in many cases, for instance:

- 400-tps 1-client (or 4 for medium) max 100-ms latency

     options   | percent of late transactions
  flush | sort | tiny | small | medium
    off |  off | 12.0 | 64.28 | 68.6
    off |   on | 11.3 | 22.05 | 22.6
     on |  off |  1.1 | 67.93 | 67.9
     on |   on |  0.6 |  3.24 |  3.1

The "percent of late transactions" is really the fraction of time the database is unreachable because of write stalls... So sort without flush is cleary not enough.

Another thing suggested by Andres is to fsync as early as possible, but this is not a simple patch because is intermix things which are currently in distinct parts of checkpoint processing, so I already decided that this would be for another submission.

* Is it really necessary to parallelize the I/O among tablespaces? I can see the point, but I wonder if it makes any difference in practice.

I think that if someone bothers with tablespace there is no reason to kill them behind her. Without sorting you may hope that tablespaces will be touched randomly enough, but once buffers are sorted you can probably find cases where it would write on one table space and then on the other.

So I think that it really should be kept.

* Is there ever any harm in sorting the buffers? The GUC is useful for benchmarking, but could we leave it out of the final patch?

I think that the performance show that it is basically always beneficial, so the guc may be left out. However on SSD it is unclear to me whether it is just a loss of time or whether it helps, say with wear-leveling. Maybe best to keep it? Anyway it is definitely needed for testing.

* Do we need to worry about exceeding the 1 GB allocation limit in AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's a lot, but it's not totally crazy these days that someone might do that. At the very least, we need to lower the maximum of shared_buffers so that you can't hit that limit.

Yep.

I ripped out the "flushing" part, keeping only the sorting. I refactored the logic in BufferSync() a bit. There's now a separate function, nextCheckpointBuffer(), that returns the next buffer ID from the sorted list. The tablespace-parallelization behaviour in encapsulated there,

I do not understand the new tablespace-parallelization logic: there is no test about the tablespace of the buffer in the selection process... Note that I did wrote a proof for the one I put, and also did some detailed testing on the side because I'm always wary of proofs, especially mines:-)

I notice that you assume that table space numbers are always small and contiguous. Is that a fact? I was feeling more at ease with relying on a hash table to avoid such an assumption.

keeping the code in BufferSync() much simpler. See attached. Needs some minor cleanup and commenting still before committing, and I haven't done any testing besides a simple "make check".

Hmmm..., just another detail, the patch does not sort:

  + if (checkpoint_sort && num_to_write > 1 && false)


I'll resubmit a patch with only the sorting part, and do the kind of restructuring you suggest which is a good thing.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to