Re: [HACKERS] checkpointer continuous flushing - V18

Fabien COELHO Sun, 21 Feb 2016 01:54:07 -0800


Hallo Andres,


Here is a review for the second patch.

For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
 it doesn't take a lot of time.

I put it there initialy to demonstrate that there was no cache performanceissue when sorting on just buffer indexes. As it is always small, I agreethat it is not needed. Well, it could be still be in seconds on a verylarge shared buffers setting with a very large checkpoint, but then thecheckpoint would be tremendously huge...

* Minor comment polishing.


Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a littleon the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also,there are very few tablespaces, they could be given a small number andthis number could be used instead of the Oid, so the space requirementcould be reduced to say 16 bytes per buffer by combining space & fork in 2shorts and keeping 4 bytes alignement and also getting 8 bytealignement... If this is too much, I have shown that it can work with only4 bytes per buffer, as the sorting is really just a performanceoptimisation and is not broken if some stuff changes between sorting &writeback, but you did not like the idea. If the amount of shared memoryrequired is a significant concern, it could be resurrected, though.


* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8precision is okay for the purpose. I think it is quite certainly fine inthe worst case with 32 bits buffer_ids, but it would not be if this sizeis changed someday.


* BufferSync

After a first sweep to collect buffers to write, they are sorted, and thenthere those buffers are swept again to compute some per tablespace dataand organise a heap.

ISTM that nearly all of the collected data on the second sweep could becollected on the first sweep, so that this second sweep could be avoidedaltogether. The only missing data is the index of the first buffer in thearray, which can be computed by considering tablespaces only, sweepingover buffers is not needed. That would suggest creating the heap or usinga hash in the initial buffer sweep to keep this information. This wouldalso provide a point where to number tablespaces for compressing theCkptSortItem struct.


I'm wondering about calling CheckpointWriteDelay on each round, maybe

a minimum amount of write would make sense. This remark is independent ofthis patch. Probably it works fine because after a sleep the checkpointeris behind enough so that it will write a bunch of buffers before sleeping

again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
 mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
 inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible toforce using it with a "ifdef 0" to skip sync_file_range and check whetherit does some good there. Idem for the "posix_fadvise" stuff. I can try todo that, but it takes time to do so, if someone can test on other OS itwould be much better. I think that if it works it should be kept in, so itis just a matter of testing it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing - V18

Reply via email to