Re: [HACKERS] Partitioned tables and relfilenode

Amit Langote Thu, 16 Feb 2017 18:35:48 -0800

On 2017/02/15 16:14, Michael Paquier wrote:
> On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote
> <[email protected]> wrote:
>> The new partitioned tables do not contain any data by themselves.  Any
>> data inserted into a partitioned table is routed to and stored in one of
>> its partitions.  In fact, it is impossible to insert *any* data before a
>> partition (to be precise, a leaf partition) is created.  It seems wasteful
>> then to allocate physical storage (files) for partitioned tables.  If we
>> do not allocate the storage, then we must make sure that the right thing
>> happens when a command that is intended to manipulate a table's storage
>> encounters a partitioned table, the "right thing" here being that the
>> command's code either throws an error or warning (in some cases) if the
>> specified table is a partitioned table or ignores any partitioned tables
>> when it reads the list of relations to process from pg_class.  Commands
>> that need to be taught about this are vacuum, analyze, truncate, and alter
>> table.  Specifically:
> 
> Thanks. I have been looking a bit at this set of patches.


Thanks for reviewing!

>> - In case of vacuum, specifying a partitioned table causes a warning
>> - In case of analyze, we do not throw an error or warning but simply
>>   avoid calling do_analyze_rel() *non-recursively*.  Further in
>>   acquire_inherited_sample_rows(), any partitioned tables in the list
>>   returned by find_all_inheritors() are skipped.
>> - In case of truncate, only the part which manipulates table's physical
>>   storage is skipped for partitioned tables.
> 
> I am wondering if instead those should not just be errors saying that
> such operations are just not support. This could be relaxed in the
> future to mean that a vacuum, truncate or analyze on the partitioned
> table triggers an operator in cascade.

As the discussion down-thread suggests, that might be a path worth
considering.  That is, vacuum or analyze on a partitioned table causes all
the (leaf) partitions to be vacuumed or analyzed.

Truncate already does that (that is, recurse to leaf partitions).  This
patch simply prevents ExecuteTruncate() from trying to manipulate the
partitioned table's relfilenode.

>> - Since we cannot create indexes on partitioned tables anyway, there is
>>   no need to handle cluster and reindex (they throw a meaningful error
>>   already due to the lack of indexes.)
> 
> Yep.

This one is still under discussion.

>> Patches 0001 and 0002 of the attached implement the above part.  0001
>> teaches the above mentioned commands to do the "right thing" as described
>> above and 0002 teaches heap_create() and heap_create_with_catalog() to not
>> create any physical storage (none of the forks) for partitioned tables.
> 
> -   else
> +   /*
> +    * Although we cannot analyze partitioned tables themselves, we are
> +    * still be able to do the recursive ANALYZE.
> +    */
> +   else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
>     {
>         /* No need for a WARNING if we already complained during VACUUM */
>         if (!(options & VACOPT_VACUUM))
> It seems to me that it is better style to use an empty else if with
> only the comment. Comments related to a condition that are outside
> this condition should be conditional in their formulations. Comments
> within a condition can be affirmations when they refer to a decided
> state of things.

Not sure if this will survive based on the decisions on what to do about
vacuum and analyze, but how about the following rewrite of the above:

else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
    /*
     * Although we cannot analyze partitioned tables themselves, we are
     * still be able to do the recursive ANALYZE, which we do below.
     */
}
else
{

>>From patch 2:
> @@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname,
>                relkind == RELKIND_TOASTVALUE ||
>                relkind == RELKIND_PARTITIONED_TABLE);
> 
> -       heap_create_init_fork(new_rel_desc);
> +       /*
> +        * We do not want to create any storage objects for a partitioned
> +        * table.
> +        */
> +       if (relkind != RELKIND_PARTITIONED_TABLE)
> +           heap_create_init_fork(new_rel_desc);
>     }
> This does not make much sense with an assertion telling exactly the
> contrary a couple of lines before. I think that it would make more
> sense to move the assertion on relkind directly in
> heap_create_init_fork().

Hmm, perhaps the below is saner, with the Assert moved to the start of
heap_create_init_fork() as shown below:

/*
 * We do not want to create any storage objects for a partitioned
 * table, including the init fork.
 */
if (relpersistence == RELPERSISTENCE_UNLOGGED &&
    relkind != RELKIND_PARTITIONED_TABLE)
    heap_create_init_fork(new_rel_desc);


@@ -1382,6 +1376,8 @@ heap_create_with_catalog(const char *relname,
 void
 heap_create_init_fork(Relation rel)
 {
+    Assert(relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW ||
+           relkind == RELKIND_TOASTVALUE);

>> Then comes 0003, which concerns inheritance planning.  In a regular
>> inheritance set (ie, the inheritance set corresponding to an inheritance
>> hierarchy whose root is a regular table), the inheritance parents are
>> included in their role as child members, because they might contribute
>> rows to the final result.  So AppendRelInfo's are created for each such
>> table by the planner prep phase, which the later planning steps use to
>> create a scan plan for those tables as the Append's child plans.
>> Currently, the partitioned tables are also processed by the optimizer as
>> inheritance sets.  Partitioned table inheritance parents do not own any
>> storage, so we *must* not create scan plans for them.  So we do not need
>> to process them as child members of the inheritance set.  0003 teaches
>> expand_inherited_rtentry() to not add partitioned tables as child members.
>>  Also, since the root partitioned table RTE is no longer added to the
>> Append list as the 1st child member, inheritance_planner() cannot assume
>> that it can install the 1st child RTE as the nominalRelation of a given
>> ModifyTable node, instead the original root parent table RTE is installed
>> as the nominalRelation.
> 
> This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE
> to avoid interactions with partition tables. Did you consider doing
> something in the executor instead?

Getting rid of the parent relation (of which there could be many if a
given partition tree is multi-level) well within the optimizer means that
the optimizer makes a plan that does not have redundant Scan nodes for
partitioned tables (and if the 2nd patch is worthy of consideration, we
absolutely cannot create Scan nodes for them at all).  And Robert's point
about optimizer's cost and selectivity estimations.

I will post the updated patches after addressing other comments in the thread.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned tables and relfilenode

Reply via email to