On 20 September 2017 at 00:06, Robert Haas <robertmh...@gmail.com> wrote: > On Fri, Sep 15, 2017 at 7:25 AM, Amit Khandekar <amitdkhan...@gmail.com> > wrote: >> [ new patch ] > > This already fails to apply again. In general, I think it would be a > good idea to break this up into a patch series rather than have it as > a single patch. That would allow some bits to be applied earlier. > The main patch will probably still be pretty big, but at least we can > make things a little easier by getting some of the cleanup out of the > way first. Specific suggestions on what to break out below. > > If the changes to rewriteManip.c are a marginal efficiency hack and > nothing more, then let's commit this part separately before the main > patch. If they're necessary for correctness, then please add a > comment explaining why they are necessary.
Ok. Yes, just wanted to avoid two ConvertRowtypeExpr nodes one over the other. But that was not causing any correctness issue. Will extract these changes into separate patch. > > There appears to be no reason why the definitions of > GetInsertedColumns() and GetUpdatedColumns() need to be moved to a > header file as a result of this patch. GetUpdatedColumns() was > previously defined in trigger.c and execMain.c and, post-patch, is > still called from only those files. GetInsertedColumns() was, and > remains, called only from execMain.c. If this were needed I'd suggest > doing it as a preparatory patch before the main patch, but it seems we > don't need it at all. In earlier versions of the patch, these functions were used in nodeModifyTable.c as well. Now that those calls are not there in this file, I will revert back the changes done for moving the definitions into header file. > > If I understand correctly, the reason for changing mt_partitions from > ResultRelInfo * to ResultRelInfo ** is that, currently, all of the > ResultRelInfos for a partitioning hierarchy are allocated as a single > chunk, but we can't do that and also reuse the ResultRelInfos created > during InitPlan. I suggest that we do this as a preparatory patch. Ok, will prepare a separate patch. Do you mean to include in that patch the changes I did in ExecSetupPartitionTupleRouting() that re-use the ResultRelInfo structures of per-subplan update result rels ? > Someone could argue that this is going the wrong way and that we ought > to instead make InitPlan() create all of the necessarily > ResultRelInfos, but it seems to me that eventually we probably want to > allow setting up ResultRelInfos on the fly for only those partitions > for which we end up needing them. The code already has some provision > for creating ResultRelInfos on the fly - see ExecGetTriggerResultRel. > I don't think it's this patch's job to try to apply that kind of thing > to tuple routing, but it seems like in the long run if we're inserting > 1 tuple into a table with 1000 partitions, or performing 1 update that > touches the partition key, it would be best not to create > ResultRelInfos for all 1000 partitions just for fun. Yes makes sense. > But this sort of > thing seems much easier of mt_partitions is ResultRelInfo ** rather > than ResultRelInfo *, so I think what you have is going in the right > direction. Ok. > > + * mtstate->resultRelInfo[]. Note: We assume that if the > resultRelInfo > + * does not belong to subplans, then it already matches the root > tuple > + * descriptor; although there is no such known scenario where this > + * could happen. > + */ > + if (rootResultRelInfo != resultRelInfo && > + mtstate->mt_persubplan_childparent_maps != NULL && > + resultRelInfo >= mtstate->resultRelInfo && > + resultRelInfo <= mtstate->resultRelInfo + mtstate->mt_nplans - 1) > + { > + int map_index = resultRelInfo - mtstate->resultRelInfo; > > I think you should Assert() that it doesn't happen instead of assuming > that it doesn't happen. IOW, remove the last two branches of the > if-condition, and then add an Assert() that map_index is sane. Ok. > > It is not clear to me why we need both mt_perleaf_childparent_maps and > mt_persubplan_childparent_maps. mt_perleaf_childparent_maps : This is used for converting transition-captured inserted/modified/deleted tuples from leaf to root partition, because we need to have all the ROWS in the root partition attribute order. This map is used only for tuples that are routed from root to leaf partition during INSERT, or when tuples are routed from one leaf partition to another leaf partition during update row movement. For both of these operations, we need per-leaf maps, because during tuple conversion, the source relation is among the mtstate->mt_partitions. mt_persubplan_childparent_maps : This is used at two places : 1. After an ExecUpdate() updates a row of a per-subplan update result rel, we need to capture the tuple, so again we need to convert to the root partition. Here, the source table is a per-subplan update result rel; so we need to have per-subplan conversion map array. So after UPDATE finishes with one update result rel, node->mt_transition_capture->tcs_map shifts to the next element in the mt_persubplan_childparent_maps array. : ExecModifyTable() { .... node->mt_transition_capture->tcs_map = node->mt_persubplan_childparent_maps[node->mt_whichplan]; .... } 2. In ExecInsert(), if it is part of update tuple routing, we need to convert the tuple from the update result rel to the root partition. So it re-uses this same conversion map. Now, instead of these two maps having separate allocations, I have arranged for the per-leaf map array to re-use the mapping allocations made by per-subplan array elements, similar to how we are doing for re-using the ResultRelInfos. But still the arrays themselves need to be separate. > > + * Note: if the UPDATE is converted into a DELETE+INSERT as part of > + * update-partition-key operation, then this function is also called > + * separately for DELETE and INSERT to capture transition table rows. > + * In such case, either old tuple or new tuple can be NULL. > > That seems pretty strange. I don't quite see how that's going to work > correctly. I'm skeptical about the idea that the old tuple capture > and new tuple capture can safely happen at different times. Actually the tuple capture involves just adding the tuple into the correct tuplestore for a particular event. There is no trigger event added for tuple capture. Calling ExecARUpdateTriggers() with either newtuple NULL or tupleid Invalid makes sure that it does not do anything other than transition capture : @@ -5306,7 +5322,8 @@ AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo, /* If transition tables are the only reason we're here, return. */ if (trigdesc == NULL || (event == TRIGGER_EVENT_DELETE && !trigdesc->trig_delete_after_row) || (event == TRIGGER_EVENT_INSERT && !trigdesc->trig_insert_after_row) || - (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row)) + (event == TRIGGER_EVENT_UPDATE && !trigdesc->trig_update_after_row) || + (event == TRIGGER_EVENT_UPDATE && (oldtup == NULL || newtup == NULL))) return; Even if we imagine a single place or a single function that we could call to do the OLD and NEW row capture, still the end result is going to be the same : OLD row would go into mtstate->mt_transition_capture->tcs_old_tuplestore, and NEW row would end up in mtstate->mt_transition_capture->tcs_update_tuplestore. Note that these are common tuple stores for all the partitions of the partition tree. (Actually I am still rebasing my patch over the recent changes where tcs_update_tuplestore no more exists; instead we need to use transition_capture->tcs_private->new_tuplestore). When we access the OLD and NEW tables for UPDATE trigger, there is no longer a co-relation as to which row of OLD TABLE correspond to which row of the NEW TABLE for a given updated row. So, at exactly which point OLD row and NEW row gets captured into their respective tuplestores, and in which order, is not important. Whereas, for the usual per ROW triggers, it is critical that the trigger event has both the OLD and NEW row together in the same trigger event, since they need to be both accessible in the same trigger function. Doing the OLD and NEW tables row capture separately is essential because the DELETE and INSERT happen on different tables, so we are not even sure if the insert is going to happen (thanks to triggers on partitions, if any). If the insert is skipped, we should not capture that tuple. > > I wonder if we should have a reloption controlling whether > update-tuple routing is enabled. I wonder how much more expensive it > is to execute UPDATE root SET a = a + 1 WHERE a = 1 on a table with > 1000 subpartitions with this patch than without, assuming the update > succeeds in both cases. You mean to check how much the patch slows down things for the existing updates involving no row movement ? And so have the reloption to have an option to disable the logic that slows down things ? > > I also wonder how efficient this implementation is in general. For > example, suppose you make a table with 1000 partitions each containing > 10,000 tuples and update them all, and consider three scenarios: (1) > partition key not updated but all tuples subject to non-HOT updates > because the updated column is indexed, (2) partition key updated but > no tuple movement required as a result, (3) partition key updated and > all tuples move to a different partition. It would be useful to > compare the times, and also to look at perf profiles and see if there > are any obvious sources of inefficiency that can be squeezed out. It > wouldn't surprise me if tuple movement is a bit slower than the other > scenarios, but it would be nice to know how much slower and whether > the bottlenecks are anything that we can easily fix. Ok yeah that would be helpful to remove any unnecessary slowness that may have been caused due to the patch; will do. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers