RE: pgsql: Add a new GUC and a reloption to enable inserts in parallel-mode

[email protected] Fri, 26 Mar 2021 02:29:10 -0700

From: Robert Haas <[email protected]>
> On Wed, Mar 24, 2021 at 12:48 AM Andres Freund <[email protected]>
> wrote:
> > Although this specific hack doesn't seem too terrible to me. If you
> > execute a parallel insert the likelihood to end up not needing an xid is
> > pretty low. Implementing it concurrently does seem like it'd end up
> > needing another lwlock nested around xid assignment, or some more
> > complicated scheme with already holding XidGenLock or retries. But maybe
> > I'm missing an easy solution here.
> 
> I don't think you need to do anything that is known outside the group
> of processes involved in the parallel query. I think you just need to
> make sure that only one of them is trying to acquire an XID at a time,
> and that all the others find out about it. I haven't thought too hard
> about the timing: if one process acquires an XID for the transaction,
> is it OK if the others do an arbitrary amount of work before they
> realize that this has happened? Also, there's the problem that the
> leader has the whole transaction stack and the workers don't, so the
> recursive nature of XID acquisition is a problem. I suspect these are
> all pretty solvable problems; I just haven't put in the energy. But,
> it could also be that I'm missing something.

It doesn't seem easy to make parallel workers allocate an XID and share it
among the parallel processes. When the DML is run inside a deeply nested
subtransaction and the parent transactions have not allocated their XIDs yet,
the worker needs to allocate the XIDs for its parents. That indeterminate
number of XIDs must be stored in shared memory. The stack of TransactionState
structures must also be passed. Also, TransactionIdIsCurrentTransactionId()
uses an array ParallelCurrentXids where parallel workers receive sub-committed
XIDs from the leader. This needs to be reconsidered.

Before that, I don't see the need for parallel workers to allocate the XID. As
the following Oracle manual says, parallel DML will be used in data analytics
and OLTP batch jobs. There should be plenty of source data in those scenarios.

When to Use Parallel DML
https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-18B2AF09-C548-48DE-A794-86224111549F
--------------------------------------------------
Several scenarios where parallel DML is used include:

Refreshing Tables in a Data Warehouse System

Creating Intermediate Summary Tables

Using Scoring Tables

Updating Historical Tables

Running Batch Jobs
--------------------------------------------------

I don't mean to say we want to use the easy hack as we want to be lazy. I'd
like to know whether we *really* need the effort. And I want PostgreSQL to
provide great competitive features as early as possible without messing up the
design and code.

For what kind of realistic conceivable scenarios do we need the sophisticated
XID assignment mechanism in parallel workers?

Regards
Takayuki Tsunakawa

RE: pgsql: Add a new GUC and a reloption to enable inserts in parallel-mode

Reply via email to