Thanks Tom. Here is a example. Just a background of things . I have made
changes in postgress execution and storage engine to make it a MPP style
engine - keeping all optimizer intact. Basically take pgress serial plan and
construct a parallel plan. The query I am running is below.

*Query*

# explain select ooj.local_time,ooj.is_timeout,ooj.capsulename,
      regexp_split_to_table(
           case
               when ooj.isubstr is null then 'none'
               when ooj.isubstr='' then 'none'
               else ooj.isubstr
           end ,';') as interest,
      sum(ooj.impression) count_impressions,
      sum(ooj.click) count_clicks
from (
   select (impression.server_utc_time/3600000)*3600 as local_time,
           impression.is_timeout_default_impression as is_timeout,
           impression.capsule_name as capsulename,
           substring(impression.website_variables from 'ybt=([0-9;]*)') as
isubstr,
           1 as impression,
           case click.impression_id when null then 0 else 1 end as click
  from
           impression_ytw2_row impression
  left outer join  clicks_row click
  on impression.impression_id = click.impression_id) ooj
group by local_time, is_timeout, capsulename, interest;

*Now if you kindly bear with me and ignore the Parallel nodes in the plan
{which are just parallel distributors} . If you compare the two plans below
and check the output tupledesc of the Hash join you will see that some
columns are not gettign pruned out - In my case this is a big column.
*
*Case 1 Plan {inline view gets merged} *

  Now when the inline view gets merged via pullup_subqueries I can see that
we have columns {impression.website_variables} which should get pruned out
but it does not after the JOIN { it gets pruned out after hash aggregate).

 Parallel reciever RANDOM queue  (nodenum=0 cost=132.79..133.12 rows=10
width=104)
   Output: (sum(1)), impression.is_timeout_default_impression,
impression.capsule_name, (sum(1)), sum(1), sum(1)
   ->  Parallel sender RANDOM queue  (nodenum=1 cost=132.79..133.12 rows=10
width=104)
         Output: (sum(1)), impression.is_timeout_default_impression,
impression.capsule_name, (sum(1)), sum(1), sum(1)
         ->  HashAggregate  (nodenum=2 cost=132.79..133.12 rows=10
width=104)
               Output: (((impression.server_utc_time / 3600000) * 3600)),
impression.is_timeout_default_impression, impression.capsule_name,
(regexp_split_to_table(CASE WHEN
("substring"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS
NULL
 THEN 'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text)
END, ';'::text)), sum(1), sum(1)
               ->  Parallel reciever HASH-AGG queue  (nodenum=3
cost=117.45..130.08 rows=181 width=104)
                     Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name, *
impression.website_variables*, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substring"((impres
ion.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL) THEN
'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=([0-9;
*)'::text) END, ';'::text)
                     ->  Parallel sender HASH-AGG queue  (nodenum=4
cost=117.45..130.08 rows=181 width=104)
                           Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name, *
impression.website_variables*, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substring"((
mpression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL) THEN
'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text, 'ybt=
[0-9;]*)'::text) END, ';'::text)
                          * ->  Hash Left Join  (nodenum=5
cost=117.45..130.08 rows=181 width=104)
                                 Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name,
impression.website_variables, ((impression.server_utc_time / 3600000) *
3600), regexp_split_to_table(CASE WHEN ("substr
ng"((impression.website_variables)::text, 'ybt=([0-9;]*)'::text) IS NULL)
THEN 'none'::text WHEN ("substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text) = ''::text) THEN 'none'::text ELSE
"substring"((impression.website_variables)::text,*
'ybt=([0-9;]*)'::text) END, ';'::text)
                                 ->  Parallel reciever HASH-LEFT queue
(nodenum=6 cost=0.00..3.10 rows=10 width=136)
                                       Output: impression.server_utc_time,
impression.is_timeout_default_impression, impression.capsule_name,
impression.website_variables, impression.impression_id
                                       ->  Parallel sender HASH-LEFT queue
(nodenum=7 cost=0.00..3.10 rows=10 width=136)
                                             Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
                                             ->  Seq Scan on
impression_ytw2_row impression  (nodenum=8 cost=0.00..3.10 rows=10
width=136)
                                                   Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
                                 ->  Hash  (nodenum=9 cost=72.20..72.20
rows=3620 width=32)
                                       Output: click.impression_id
                                       ->  Parallel reciever HASH-RIGHT
queue  (nodenum=10 cost=0.00..72.20 rows=3620 width=32)
                                             Output: click.impression_id
                                             ->  Parallel sender HASH-RIGHT
queue  (nodenum=11 cost=0.00..72.20 rows=3620 width=32)
                                                   Output:
click.impression_id
                                                   ->  Seq Scan on
clicks_row click  (nodenum=12 cost=0.00..72.20 rows=3620 width=32)
                                                         Output:
click.impression_id

*Case 2 { PLan when I disable code to merge view - basically do not call
pullup_subqueries - it does the right thing }

* Parallel reciever RANDOM queue  (nodenum=0 cost=133.70..137.32 rows=181
width=112)
   Output: ooj.local_time, ooj.is_timeout, ooj.capsulename, (),
sum((sum((sum(ooj.impression))))), sum((sum((sum(ooj.click)))))
   ->  Parallel sender RANDOM queue  (nodenum=1 cost=133.70..137.32 rows=181
width=112)
         Output: ooj.local_time, ooj.is_timeout, ooj.capsulename, (),
sum((sum(ooj.impression))), sum((sum(ooj.click)))
         ->  HashAggregate  (nodenum=2 cost=133.70..137.32 rows=181
width=112)
               Output: ooj.local_time, ooj.is_timeout, ooj.capsulename,
(regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text
WHEN (ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.isubstr END,
';'::text)), sum(ooj.impression), sum(oo
.click)
               ->  Parallel reciever HASH-AGG queue  (nodenum=3
cost=117.45..130.98 rows=181 width=112)
                     Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.isubstr
ND, ';'::text)
                     ->  Parallel sender HASH-AGG queue  (nodenum=4
cost=117.45..130.98 rows=181 width=112)
                           Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE ooj.is
bstr END, ';'::text)
                           ->  Subquery Scan ooj  (nodenum=5
cost=117.45..130.98 rows=181 width=112)
                                 Output: ooj.local_time, ooj.is_timeout,
ooj.capsulename, ooj.isubstr, ooj.impression, ooj.click,
regexp_split_to_table(CASE WHEN (ooj.isubstr IS NULL) THEN 'none'::text WHEN
(ooj.isubstr = ''::text) THEN 'none'::text ELSE
oj.isubstr END, ';'::text)
                                 *->  Hash Left Join  (nodenum=6
cost=117.45..128.27 rows=181 width=104)
                                       Output: ((impression.server_utc_time
/ 3600000) * 3600), impression.is_timeout_default_impression,
impression.capsule_name, "substring"((impression.website_variables)::text,
'ybt=([0-9;]*)'::text), 1, 1*
                                       ->  Parallel reciever HASH-LEFT
queue  (nodenum=7 cost=0.00..3.10 rows=10 width=136)
                                             Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
                                             ->  Parallel sender HASH-LEFT
queue  (nodenum=8 cost=0.00..3.10 rows=10 width=136)
                                                   Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
                                                   ->  Seq Scan on
impression_ytw2_row impression  (nodenum=9 cost=0.00..3.10 rows=10
width=136)
                                                         Output:
impression.server_utc_time, impression.is_timeout_default_impression,
impression.capsule_name, impression.website_variables,
impression.impression_id
                                       ->  Hash  (nodenum=10
cost=72.20..72.20 rows=3620 width=32)
                                             Output: click.impression_id
                                             ->  Parallel reciever
HASH-RIGHT queue  (nodenum=11 cost=0.00..72.20 rows=3620 width=32)
                                                   Output:
click.impression_id
                                                   ->  Parallel sender
HASH-RIGHT queue  (nodenum=12 cost=0.00..72.20 rows=3620 width=32)
                                                         Output:
click.impression_id
                                                         ->  Seq Scan on
clicks_row click  (nodenum=13 cost=0.00..72.20 rows=3620 width=32)
                                                               Output:
click.impression_id

*More analysis

*

   1. Looked at code in make_join_rel and build_joinrel_tlist I can see in
   one case the bms_nonempty_difference code kicks in and finds that the column
   can be pruned out and other case it does not do the right thing.
   2. I am still trying to work out why pullup_subquery should make a
   difference.

Any pointers appreciated .

Regards
Harmeek

2011 at 7:24 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> HarmeekSingh Bedi <harmeeksi...@gmail.com> writes:
>>    In a plan where Node 1 is parent {say join) and Node 2 is child
>> (say scan)  . If node 1 has a expression say foo(column) then scan
>> will project 'column'  for sure and join will
>>       evaluate foo(column).  Now if the node above join does not need
>> column ? how does postgress remove the column from join's projection
>> list .
>
> See build_joinrel_tlist() in relnode.c.
>
>> I am seeing that it does not in many
>>       cases specially when sort appears above.
>
> Please show a concrete example.
>
>                        regards, tom lane
>

Reply via email to