On Thu, May 16, 2013 at 11:55 PM, Stephen Frost <sfr...@snowman.net> wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> I think it's pretty unrealistic to suppose that this can be made to
>> work.  The most obvious problem is that a sequential scan is coded to
>> assume that every block between 0 and the last block in the relation
>> is worth reading,
>
> You don't change that.  However, when a seq scan asks the storage layer
> for blocks that it knows don't actually exist, it can simply skip over
> them or return "empty" records or something equivilant...  Yes, that's
> hand-wavy, but I also think it's doable.

And slow.  And it will involve locking and shared memory data
structures of its own, to keep track of which blocks actually exist at
the storage layer.  I suspect the results would be more kinds of locks
than we have at present, not less.

>> Also, I think that's really a red herring anyway.  Relation extension
>> per se is not slow - we can grow a file by adding zero bytes at a
>> pretty good clip, and don't really gain anything at the database level
>> by spreading the growth across multiple files.
>
> That's true when the file is on a single filesystem and a single set of
> drives.  Make them be split across multiple filesystems/volumes where
> you get more drives involved...

I'd be interested to hear how fast dd if=/dev/zero of=somefile is on
your machine compared to a single-threaded COPY into a relation.
Dividing those two numbers gives us the level of concurrency at which
the speed at which we can extend the relation becomes the bottleneck.
On the system I tested, I think it was in the multiple tens until the
kernel cache filled up ... and then it dropped way off.  But I don't
have access to a high-end storage system.

>> If I took 30 seconds to pre-extend the relation before writing any
>> data into it, then writing the data went pretty much exactly 10 times
>> faster with 10 writers than with 1.
>
> That's rather fantastic..

One sadly relevant detail is that the relation was unlogged.  Even so,
yes, it's fantastic.

>> But small on-the-fly
>> pre-extensions during the write didn't work as well.  I don't remember
>> exactly what formulas I tried, but I do remember that the few I tried
>> were not really any better than "always pre-extend by 1 extra block";
>> and that alone eliminated about half the contention, but then I
>> couldn't do better.
>
> That seems quite odd to me- I would have thought extending by more than
> 2 blocks would have helped with the contention.  Still, it sounds like
> extending requires a fair bit of writing, and that sucks in its own
> right because we're just going to rewrite that- is that correct?  If so,
> I like proposal even more...
>
>> I wonder if I need to use LWLockAcquireOrWait().
>
> I'm not seeing how/why that might help?

Thinking about it more, my guess is that backend A grabs the relation
extension lock.  Before it actually extends the relation, backends B,
C, D, and E all notice that no free pages are available and queue for
the lock.  Backend A pre-extends the relation by some number of pages
page and then extends it by a second page for its own use.  It then
releases the relation extension lock.  At this point, however,
backends B, C, D, and E are already committed to extending the
relation, even though some or all of them could now satisfy their need
for free pages from the fsm.  If they used LWLockAcquireOrWait(), then
they'd all wake up when A released the lock.  One of them would have
the lock, and the rest could go retry the fsm and requeue on the lock
if that failed.

But as it is, what I bet is happening is that they each take the lock
in turn and each extend the relation in turn.  Then on the next block
they write they all find free pages in the fsm, because they all
pre-extended the relation, but when those free pages are used up, they
all queue up on the lock again, practically at the same instant,
because the fsm becomes empty at the same time for all of them.

I should play around with this a bit more...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to