Re: [HACKERS] Parallel bitmap heap scan

Amit Khandekar Thu, 05 Jan 2017 21:19:05 -0800

Sorry for the delay in my next response. I still haven't exhaustively
gone through all changes, but meanwhile, below are some more points.

On 26 November 2016 at 18:18, Dilip Kumar <[email protected]> wrote:
> On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar <[email protected]> wrote:
>> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar <[email protected]> 
>> wrote:
>>
>> Thanks for the review..
>
> I have worked on these comments..
>>
>>> In pbms_is_leader() , I didn't clearly understand the significance of
>>> the for-loop. If it is a worker, it can call
>>> ConditionVariablePrepareToSleep() followed by
>>> ConditionVariableSleep(). Once it comes out of
>>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>>> finished the bitmap ? If yes, then it looks like it is not necessary
>>> to again iterate and go back through the pbminfo->state checking.
>>> Also, with this, variable queuedSelf also might not be needed. But I
>>> might be missing something here.
>>
>> I have taken this logic from example posted on conditional variable thread[1]
>>
>> for (;;)
>> {
>>     ConditionVariablePrepareToSleep(cv);
>>     if (condition for which we are waiting is satisfied)
>>         break;
>>     ConditionVariableSleep();
>> }
>> ConditionVariableCancelSleep();
>>
>> [1] 
>> https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

With the latest patches, this looks fine to me.

>>> tbm_iterate() call under SpinLock :
>>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>>> held. Generally we try to keep code inside Spinlock call limited to a
>>> few lines, and that too without occurrence of a function call.
>>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>>> closer to each other. One thought is :  in tbm_iterate(), acquire the
>>> SpinLock before the while loop that iterates over lossy chunks. Then,
>>> if both chunk and per-page data remain, release spinlock just before
>>> returning (the first return stmt). And then just before scanning
>>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>>> tbm->npages)", save the page handle, increment iterator->spageptr,
>>> release Spinlock, and then use the saved page handle to iterate over
>>> the page bitmap.
>>
>> Main reason to keep Spin lock out of this function to avoid changes
>> inside this function, and also this function takes local iterator as
>> input which don't have spin lock reference to it. But that can be
>> changed, we can pass shared iterator to it.
>>
>> I will think about this logic and try to update in next version.
> Still this issue is not addressed.
> Logic inside tbm_iterate is using same variable, like spageptr,
> multiple places. IMHO this complete logic needs to be done under one
> spin lock.

I think we both agree on the part that the mutex handle can be passed
to tbm_iterate() to do this. I am yet to give more thought on how
clumsy it will be if we add SpinlockRelease() calls in intermediate
places in the function rather than in the end.

>
>>
>>>
>>> prefetch_pages() call under Spinlock :
>>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>>> would get called while under the Spinlock. These can even ereport().
>>> One option is to use mutex lock for this purpose. But I think that
>>> would slow things down. Moreover, the complete set of prefetch pages
>>> would be scanned by a single worker, and others might wait for this
>>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>>> part viz : iterating over the prefetch pages, and doing the
>>> PrefetchBuffer() need not be synchronised using this
>>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>>> its own iterator spinlock. Only thing is, workers may not do the
>>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>>> and prefetch 3-4 pages while the other is lagging with the
>>> sequentially lesser page number; but I believe this is fine, as long
>>> as they all prefetch all the required blocks.
>>
>> I agree with your point, will try to fix this as well.
>
> I have fixed this part.

This looks good now.

Further points below ....

===== nodeBItmapHeapscan.c ======

In BitmapHeapNext(), the following code is relevant only for tbm
returned from MultiExecProcNode().
if (!tbm || !IsA(tbm, TIDBitmap))
elog(ERROR, "unrecognized result from subplan");

So it should be moved just below MultiExecProcNode(), so that it does
not get called when it is created from tbm_create().

--------------

BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
local variable is_leader in BitmapHeapNext() will solve the purpose.
This is because is_leader is used only in BitmapHeapNext().

--------------

In BitmapHeapNext(), just before tbm_restore_shared_members() is
called, we create tbm using tbm_create(). I think tbm_create() does
not make sense for shared tbm. Whatever fields are required, will be
restored in tbm_restore_shared_members(). The other fields which do
not make sense in a restored tbm are not required to be initialized
using tbm_create(). So I think tbm_restore_shared_members() itself can
call makeNode(TIDBitmap). (Also it is not required to initialize
tbm->allocator; see note below in tidbitmap.c section). So
tbm_restore_shared_members() itself can call tbm_set_parallel().
Looking at all this, it looks better to have the function name changed
to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
tbm_restore_shared_members(). The function header anyways (rightly)
says  : Attach worker to shared TID bitmap.

-------------

>From what I understand, the leader worker does not have to create its
iterators before waking up the other workers, as long as it makes sure
it copies tbm fields into shared memory before waking workers. But in
the patch, tbm_begin_iterate() is called *before* the
ConditionVariableBroadcast() is called. So I feel, we can shift the
code inside the "if (node->is_leader)" to a place inside the "if
(pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
after MultiExecProcNode() call. (And we won't even need is_leader
local variable as well). This way now the other workers will start
working sooner.

====== tidbitmap.c =======

The new #include's are not in alphabetical order.

--------------

ParallelTIDBitmap.inited is unused, and I believe, not required.

--------------

For leader worker, the new TIDBitmap fields added for parallel bitmap
*are* valid while the tid is being built. So the below comment should
be shifted accordingly :
/* these are valid when iterating is true: */
Better still, the shared tbm-related fields can be kept in the end,
and a comment should be added that these are for shared tbm.

--------------

It seems, the comment below the last ParallelTIDBitmap field is not relevant :
/* table to dsa_pointer's array. */

--------------

tbm->allocator field does not seem to be required. A new allocator can
be just palloc'ed in tbm_create_pagetable(), and passed to
pagetable_create(). SH_CREATE() stores this allocator in the
SH_TYPE->alloc field, and fetches the same whenever it needs it for
calling any of the allocator functions. So we can remove the
tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
from tbm_create() to tbm_create_pagetable().

--------------

In tbm_free() :
I think we should call pagetable_destroy() even when tbm is shared, so
that the hash implementation gets a chance to free the hash table
structure. I understand that the hash table structure itself is not
big, so it will only be a small memory leak, but it's better to not
assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
is false. Basically, tbm bitmap implementation should deduce from the
bitmap state whether it should free the shared data, rather than
preventing a call to SH_DESTROY().

-----------

In tbm_begin_iterate(), for shared tbm, internal structures from
simplehash.h are assumed to be known. For e.g., the hash entries will
always be present in one single array, and also the entry status is
evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
mind that these things are suppose to be exposed ?

I understand that the hash table handle is local to the leader worker,
and so it is not accessible to other workers. And so, we cannot use
pagetable_iterate() to scan the hash table. So, how about copying the
SH_TYPE structure and making it accessible to other workers ? If we
have something like SH_ATTACH() or SH_COPY(), this will copy the
relevant fields that are sufficient to restore the SH_TYPE structure,
and other workers can start using this hash table after assigning dsa
array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

--------------

In the tbm_update_shared_members() header comments :
 * Restore leaders private tbm state to shared location. This must
 * be called before waking up the other worker.

Above can be changed to:
 * Store leader's private tbm state to shared location. This must
 * be called before waking up other workers.

--------------

To be consistent with other function header comments in this file, a
blank line is required between these two lines :

 * tbm_estimate_parallel_tidbitmap
 * Estimate size of shared TIDBitmap related info.

-------------

tbm->is_shared is set in both tbm_set_parallel() and
tbm_restore_shared_members(). I think it is needed only in
tbm_set_parallel().

--------------

tbm_alloc_shared() Function header comments need some typo correction :
 * It allocated memory from DSA and also stores dsa_pointer in the memory
allocated => allocates

--------------

We prepend the dsa_pointer value into the shared memory data allocated
for the pagetable entries; and we save the address of the first  page
table entry in tbm->data. But the same dsa_pointer is also stored in
tbm->dsa_data. And tbm is accessible in tbm_free_shared(). So it does
not look necessary to prepend dsa_pointer before the page table
entries. In tbm_free_shared(), we can do dsa_free(tbm->area,
tbm->dsa_data).

--------------

Below are my suggested changes in the algorithm comments :

@@ -103,27 +103,27 @@ BitmapHeapNext(BitmapHeapScanState *node)
        /* --------------------------------------------------------------------
         *      Parallel Bitmap Heap Scan Algorithm
         *
-        *      First worker to see the state as parallel bitmap info
as PBM_INITIAL
-        *      become leader and set the state to PBM_INPROGRESS All
other workers
-        *      see the state as PBM_INPROGRESS will wait for leader
to complete the
-        *      TIDBitmap.
+        *      The first worker to see the state of
ParallelBitmapInfo as PBM_INITIAL
+        *      becomes the leader and sets the state to
PBM_INPROGRESS. All other
+        *      workers see the state as PBM_INPROGRESS, and will wait
for leader to
+        *      finish building the TIDBitmap.
         *
         *      Leader Processing:
         *        Create TIDBitmap using DSA memory.
         *        Restore local TIDBitmap variable information into
-        *        ParallelBitmapInfo so that other worker can see those.
-        *        set state to PBM_FINISHED.
+        *           ParallelBitmapInfo so that other worker can see those.
+        *        Set state to PBM_FINISHED.
         *        Wake up other workers.
         *
         *      Other Worker Processing:
-        *        Wait until leader create shared TIDBitmap.
-        *        copy TIDBitmap from info from ParallelBitmapInfo to
local TIDBitmap.
+        *        Wait until leader creates shared TIDBitmap.
+        *        Copy TIDBitmap from ParallelBitmapInfo to local TIDBitmap.
         *
         *      Iterate and process the pages.
         *        a) In this phase each worker will iterate over page
and chunk array
-        *               and select heap pages one by one. If prefetch
is enable then
-        *               there will be two iterator.
-        *        b) Since multiple worker are iterating over same
page and chunk
+        *               and select heap pages one by one. If prefetch
is enabled then
+        *               there will be two iterators.
+        *        b) Since multiple workers are iterating over same
page and chunk
         *               array we need to have a shared iterator, so
we grab a spin
         *               lock and iterate within a lock.

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

Reply via email to