> Thanks for review, Nikhil and Michael.
> I don’t follow here. We are moving data away from WAL to files on checkpoint 
> because after checkpoint
> there is no guaranty that WAL segment with our prepared tx will be still 
> available.

We are talking about the recovery/promote code path. Specifically this
call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().

We write the files to disk and they get immediately read up in the
following code. We could not write the files to disk and read
KnownPreparedList in the code path that follows as well as elsewhere.


>> The difference between those two is likely noise.
>> By the way, in those measurements, the OS cache is still filled with
>> the past WAL segments, which is a rather best case, no? What happens
>> if you do the same kind of tests on a box where memory is busy doing
>> something else and replayed WAL segments get evicted from the OS cache
>> more aggressively once the startup process switches to a new segment?
>> This could be tested for example on a VM with few memory (say 386MB or
>> less) so as the startup process needs to access again the past WAL
>> segments to recover the 2PC information it needs to get them back
>> directly from disk... One trick that you could use here would be to
>> tweak the startup process so as it drops the OS cache once a segment
>> is finished replaying, and see the effects of an aggressive OS cache
>> eviction. This patch is showing really nice improvements with the OS
>> cache backing up the data, still it would make sense to test things
>> with a worse test case and see if things could be done better. The
>> startup process now only reads records sequentially, not randomly
>> which is a concept that this patch introduces.
>> Anyway, perhaps this does not matter much, the non-recovery code path
>> does the same thing as this patch, and the improvement is too much to
>> be ignored. So for consistency's sake we could go with the approach
>> proposed which has the advantage to not put any restriction on the
>> size of the 2PC file contrary to what an implementation saving the
>> contents of the 2PC files into memory would need to do.
> Maybe i’m missing something, but I don’t see how OS cache can affect 
> something here.
> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. 
> Sequential reading 1GB of data
> is order of magnitude faster even on the old hdd, not speaking of ssd. Also 
> you can take a look on flame graphs
> attached to previous message — majority of time during recovery spent in 
> pg_qsort while replaying
> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of 
> time. That amount can
> grow in case of uncached disk read but taking into account total recovery 
> time this should not affect much.
> If you are talking about uncached access only during checkpoint than here we 
> are restricted with
> max_prepared_transaction, so at max we will read about hundred of small files 
> (usually fitting into one filesystem page) which will also
> be barely noticeable comparing to recovery time between checkpoints. Also wal 
> segments cache eviction during
> replay doesn’t seems to me as standard scenario.
> Anyway i took the machine with hdd to slow down read speed and run tests 
> again. During one of the runs i
> launched in parallel bash loop that was dropping os cache each second (while 
> wal fragment replay takes
>  also about one second).
> 1.5M transactions
>  start segment: 0x06
>  last segment: 0x47
> patched, with constant cache_drop:
>   total recovery time: 86s
> patched, without constant cache_drop:
>    total recovery time: 68s
> (while difference is significant, i bet that happens mostly because of 
> database file segments should be re-read after cache drop)
> master, without constant cache_drop:
>    time to recover 35 segments: 2h 25m (after that i tired to wait)
>    expected total recovery time: 4.5 hours
> --
> Stas Kelvich
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company

 Nikhil Sontakke                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to