On Tue, Jan 12, 2016 at 5:52 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-01-12 17:50:36 +0530, Amit Kapila wrote:
> > On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <and...@anarazel.de>
> > >
> > > My theory is that this happens due to the sorting: pgbench is an
> > > heavy workload, the first few pages are always going to be used if
> > > there's free space as freespacemap.c essentially prefers those. Due to
> > > the sorting all a relation's early pages are going to be in "in a
> > >
> > Not sure, what is best way to tackle this problem, but I think one way
> > be to perform sorting at flush requests level rather than before writing
> > to OS buffers.
> I'm not following. If you just sort a couple hundred more or less random
> buffers - which is what you get if you look in buf_id order through
> shared_buffers - the likelihood of actually finding neighbouring writes
> is pretty low.
Why can't we do it at larger intervals (relative to total amount of writes)?
To explain, what I have in mind, let us assume that checkpoint interval
is longer (10 mins) and in the mean time all the writes are being done
by bgwriter which it registers in shared memory so that later checkpoint
can perform corresponding fsync's, now when the request queue
becomes threshhold size (let us say 1/3rd) full, then we can perform
sorting and merging and issue flush hints. Checkpointer task can
also follow somewhat similar technique which means that once it
has written 1/3rd or so of buffers (which we need to track), it can
perform flush hints after sort+merge. Now, I think we can also
do it in checkpointer alone rather than in bgwriter and checkpointer.
Basically, I think this can lead to lesser merging of neighbouring
writes, but might not hurt if sync_file_range() API is cheap.