Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
  so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
  worker, per subplan basis. If subplan contains large tables and
  small tables, ParallelAppend may kick background worker to scan
  large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
  individual node types. SeqScan, IndexScan, ForeignScan or others
  can perform as usual, but actually in parallel.


3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
  1. Sum startup_cost and run_cost for each child pathnode, but
     distinguish according to synchronous or asynchronous.
     Probably, total cost of pathnode is less than:
      (parallel_setup_cost + its total cost / parallel_append_degree
                           + number of rows * cpu_tuple_comm_cost)
     is nonsense to run on background worker.
  2. parallel_setup_cost * (# of asynchronous nodes) are added to
     sum of startup_cost of asynchronous nodes.
  3. sum of run_cost of asynchronous nodes are divided by 
     parallel_append_degree, then cpu_tuple_comm_cost * (total # of
     rows by asynchronous nodes) are added.
  4. both of synchronous and asynchronous cost are added, then it
     becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c
  Because methods field is, basically, a set of pointers per process basis,
  we need to have an infrastructure to reproduce same table on the background
  worker process identified by the name.
  (I also try to design it.)

* Duplication of the parallel
  If Funnel+PartialSeqScan is located under ParallelAppend, directly
  or indirectly, it eventually leads background worker process to launch
  another background workers. Is it expected usage of the current background
  workers??

* Join pushdown
  Distribution of nested-loop and hash-join may have advantage by parallel
  processing, and by reduction of hash-size if CHECK() constraint of
  individual partitioned tables informs rows obviously not to be joined.
  Also see the thread:
    [idea] table partition + hash join: http://bit.ly/1S2xpHT
  My colleague already started to investigate / develop this feature
  based on existing Append, to reduce num_batches.

  As an aside, my GpuJoin feature works most effectively if entire inner
  relations can be loaded to hash-table on GPU RAM, so features are very
  welcome.

* Sort break-down
  If mergejoin tried to have ParallelAppend node on left or right input,
  we may be able to compare its cost with MargeParallelAppend + Sort on
  the partial relation.

* Aggregate Push Down
  It is what I exactly want to do.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to