On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar <dilipbal...@gmail.com> wrote: > interface_dsa_allocate0.patch : Provides new interface dsa_allocate0, > which is used by 0001
Committed. I moved the function prototype for dsa_allocate0() next to the existing prototype for dsa_allocate(). Please try to think about things like this when you add functions or prototypes or structure members: put them in a natural place where they are close to related things, not just wherever happens to be convenient. > gather_shutdown_childeren_first.patch : Processing the child node This same problem came up on the Parallel Hash thread. I mentioned this proposed fix over there; let's see what he (or anyone else) has to say about it. > 0001: tidbimap change to provide the interfaces for shared iterator. + * in order to share relptrs of the chunk and the pages arrays and other + * TBM related information with other workers. No relptrs any more. + dsa_pointer dsapagetable; /* dsa_pointer to the element array */ + dsa_pointer dsapagetableold; /* dsa_pointer to the old element array */ + dsa_area *dsa; /* reference to per-query dsa area */ + dsa_pointer ptpages; /* dsa_pointer to the page array */ + dsa_pointer ptchunks; /* dsa_pointer to the chunk array */ Let's put the DSA pointer first and then the other stuff after it. That seems more logical. +typedef struct TBMSharedIteratorState +{ + int spageptr; /* next spages index */ + int schunkptr; /* next schunks index */ + int schunkbit; /* next bit to check in current schunk */ + LWLock lock; /* lock to protect above members */ + dsa_pointer pagetable; /* dsa pointers to head of pagetable data */ + dsa_pointer spages; /* dsa pointer to page array */ + dsa_pointer schunks; /* dsa pointer to chunk array */ + int nentries; /* number of entries in pagetable */ + int maxentries; /* limit on same to meet maxbytes */ + int npages; /* number of exact entries in pagetable */ + int nchunks; /* number of lossy entries in pagetable */ +} TBMSharedIteratorState; I think you've got this largely backwards from the most logical order. Let's order it like this: nentries, maxentries, npages, nchunks, pagetable, spages, schunks, lock (to protect BELOW members), spageptr, chunkptr, schunkbit. +struct TBMSharedIterator +{ + PagetableEntry *ptbase; /* pointer to the pagetable element array */ + dsa_area *dsa; /* reference to per-query dsa area */ + int *ptpages; /* sorted exact page index list */ + int *ptchunks; /* sorted lossy page index list */ + TBMSharedIteratorState *state; /* shared members */ + TBMIterateResult output; /* MUST BE LAST (because variable-size) */ +}; Do we really need "dsa" here when it's already present in the shared state? It doesn't seem like we even use it for anything. It's needed to create each backend's TBMSharedIterator, but not afterwards, I think. In terms of ordering, I'd move "state" to the top of the structure, just like "tbm" comes first in a TBMIterator. + * total memory consumption. If dsa passed to this function is not NULL + * then the memory for storing elements of the underlying page table will + * be allocated from the DSA. Notice that you capitalized "DSA" in two different ways in the same sentence; I'd go for the all-caps version. Also, you need the word "the" before the first one. + if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING)) Excess parentheses. + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap + * by multiple workers using shared iterator. Prepare to iterate, not prepare to iterator. I suggest rephrasing this as "prepare shared iteration state for a TIDBitmap". + * The TBMSharedIteratorState will be allocated from DSA so that multiple + * worker can attach to it and iterate jointly. Maybe change to "The necessary shared state will be allocated from the DSA passed to tbm_create, so that multiple workers can attach to it and iterate jointly". + * This will convert the pagetable hash into page and chunk array of the index + * into pagetable array so that multiple workers can use it to get the actual + * page entry. I think you can leave off everything starting from "so that". It's basically redundant with what you already said. + dsa_pointer iterator; + TBMSharedIteratorState *iterator_state; These aren't really great variable names, because the latter isn't the state associated with the former. They're both the same object. Maybe just "dp" and "istate". I think this function should Assert(tbm->dsa != NULL) and Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED). + /* + * If we have a hashtable, create and fill the sorted page lists, unless + * we already did that for a previous iterator. Note that the lists are + * attached to the bitmap not the iterator, so they can be used by more + * than one iterator. However, we keep dsa_pointer to these in the shared + * iterator so that other workers can access them directly. + */ This is mostly cut-and-pasted from tbm_begin_iterate() but it's not really correct here now because (1) we're no longer trying to fake up a TIDBitmap proper in every backend and (2) this code runs even if we don't have a hashtable. I think the comment should be something like "If we're not already iterating, create and fill the sorted page lists. (If we are, the sorted page lists are already stored in the TIDBitmap, and we can just reuse them.)" + * Create page list and chunk list using relptr so that we can share + * this information across multiple workers. No relptrs any more. + tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int))); Extra parentheses. + tbm->nchunks * (sizeof(int))); Extra parentheses. + * If TBM status is TBM_HASH then iterate over the pagetable and "If the TBM status is"... + * directly store it's index i.e. 0 in page array s/it's/its/ Don't you also need some code here to handle the TBM_EMPTY case? + /* + * Store the TBM member in the shared state so that we can share them + * across multiple workers. + */ + iterator_state->maxentries = tbm->maxentries; + iterator_state->nchunks = tbm->nchunks; + iterator_state->nentries = tbm->nentries; + iterator_state->npages = tbm->npages; + iterator_state->pagetable = tbm->dsapagetable; + iterator_state->spages = tbm->ptpages; + iterator_state->schunks = tbm->ptchunks; + + /* Initialize the shared iterator state */ + iterator_state->schunkbit = 0; + iterator_state->schunkptr = 0; + iterator_state->spageptr = 0; + + /* Initialize the iterator lock */ + LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR); Set all of the structure members in the same order that you declare them. + * tbm_extract_page_tuple - extract the tuple offsets from the page s/the page/a page/ + * Process the page bits to extract the tuple offsets and store them into + * TBMIterateResult. This duplicates the preceding, a bit. Maybe just "The extracted offsets are stored into the TBMIterateResult". +/* + * tbm_advance_schunkbit + * + * Advance the chunkbit to get the next page entry from the chunk + */ The formatting of this header comment is randomly different from the preceding and following header comments. I would change the argument to schunkbitp, declare a local variable named schunkbit, and do int schunkbit = *schunkbitp at the top and *schunkbitp = schunkbit at the bottom. + * As above, but this will iterate using shared iterator which is shared + * across multiple workers. We need to acquire the iterator LWLock, before + * accessing the shared members. "using shared iterator which is shared" -> "using an iterator which is shared" "multiple workers" -> "multiple processes" + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr]; Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer. + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr]; + PagetableEntry *page = ptbase + ptpages[state->spageptr]; + PagetableEntry *page = ptbase + ptpages[state->spageptr]; Similarly. + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap Maybe s/an iteration/a shared iteration/ + * As above, but it frees the memory of TBMSharedIterator. Instead of this, I'd write "This doesn't free any of the shared state associated with the iterator, just our backend-private state." + PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left); + PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right); Again, try to use array dereference notation rather than pointer arithmetic where you can. Maybe: PageTableEntry *base = (PageTableEntry *) arg; PageTableEntry *lpage = &arg[*(int *) left]; etc. +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator) +{ + TBMSharedIterator *shared_iterator; + TBMSharedIteratorState *shared_state; Again, variable naming issues. Let's go with dsa_pointer dp, TBMSharedIterator *iterator, TBMSharedIteratorState *istate. Please try to go through and always use the same variable name for the same kind of object. It's confusing when "iterator" sometimes means a TBMSharedIterator * and sometimes a TBMSharedIteratorState * and sometimes a dsa_pointer. Make it all consistent (and hopefully logical). + shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) + + MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); Wrong sizeof. (This isn't just a style issue - it's a bug!) + * Callback function for allocating the memory for hashtable elements. + * It allocates memory from DSA if tbm holds a reference to a dsa. Inconsistent capitalization of "DSA" again. Maybe change the whole comment to "Allocate memory for hashtable elements, using DSA if available." + if (tbm->dsapagetable) Test against InvalidDsaPointer. Or maybe just get rid of the "if" and do it unconditionally; if dsapagetable is InvalidDsaPointer then dsapagetableold will be too, so no harm done. + LWTRANCHE_PARALLEL_TBM_ITERATOR, Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in lwlock.c, and patch the docs with the new tranche name. Despite this fairly extensive list of mostly-trivial cosmetic problems, I think this is in fairly good shape now. I think the basic design is right, so once we get these loose ends tied up this should be good to go in. > 0002: actual parallel bitmap heap scan by using interfaces of 0001. This needs a lot more extensive review than I quite have the time and energy for right now, but here are a few comments. + <entry><literal>ParallelBitmapScan</></entry> ParallelBitmapPopulate, maybe? You're not waiting for the scan; you're waiting for the bitmap to be populated. + <entry>Waiting for leader backend to complete the TidBitmap.</entry> Likewise, complete -> populate? + /* If any limit was set to zero, the user doesn't want a parallel scan. */ If we got zero? There aren't multiple limits mentioned anywhere nearby, so this reads oddly. + /* + * It may be possible to amortize some of the I/O cost, but probably + * not very much, because most operating systems already do aggressive + * prefetching. For now, we assume that the disk run cost can't be + * amortized at all. + */ This looks like it was cut-and-pasted from the sequential scan case, but it doesn't really apply here. This presupposes we're relying on OS prefetching, but in the case of a bitmap heap scan we do our own prefetching. Maybe change to "Since we do prefetching in both the parallel and non-parallel cases, we don't expect to get substantially more I/O parallelism by spreading the scan across multiple backends. So, assume the disk run cost can't be amortized at all." -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers