Dear PostgreSQL hackers,

[ please CC to me as I'm not subscribed to the list ]

I think there is a missing optimization when a filter is
applied after a window function, where the filtered field
is also used for partitioning.

Here a simplified example: Suppose we have a table
that stores 100.000 ids, and 10 random values per id:

    create table test (
        id integer,
        value double precision,
        primary key (id, value)
    );

    insert into test (id, value)
        select
            generate_series(1, 100000) id,
            random()
        from
            generate_series(1, 10);

We are now asking for the rank of each value per id
(it doesn't matter if it's rank() or any other window
function):

    select
        id,
        value,
        rank() over (partition by id order by value)
    from
        test;

Now suppose we're interested only in the result for
id 23000. If the query above was defined in a view,
this would lead to the following query:

    select
        *
    from
        (
            select
                id,
                value,
                rank() over (partition by id order by value)
            from
                test
        ) subquery
    where
        id = 23000;

This is very slow! The "explain" command shows the reason:

    Subquery Scan on subquery  (cost=0.00..86556.80 rows=5000 width=20)
       Filter: (subquery.id = 23000)
        ->  WindowAgg  (cost=0.00..74056.80 rows=1000000 width=12)
             ->  Index Scan using test_pkey on test  (cost=0.00..56556.80 
rows=1000000 width=12)

The Filter (id = 23000) is applied after WindowAgg,
although it could certainly be applied before!

Of course we could rewrite the above query like this:

    select
        id,
        value,
        rank() over (partition by id order by value)
    from
        test
    where
        id = 23000;

This would lead to a more sane plan where the filter is
applied before WindowAgg, making use of the primary key:

    WindowAgg  (cost=43.53..43.70 rows=10 width=12)
      ->  Sort  (cost=43.53..43.55 rows=10 width=12)
            Sort Key: value
            ->  Bitmap Heap Scan on test  (cost=4.53..43.36 rows=10 width=12)
                  Recheck Cond: (id = 23000)
                  ->  Bitmap Index Scan on test_pkey  (cost=0.00..4.53 rows=10 
width=0)
                        Index Cond: (id = 23000)

However, the second query cannot be cleanly split into
a generic view and a specialized part anymore.

In other words, it is currently impossible to use Window
functions in a View without risking serious, unexpected
performance issues, even in simple cases!

I propose the following general optimization: If all window
functions are partitioned by the same first field (here: id),
then any filter on that field should be executed before
WindowAgg. So a query like this:

    select
        ...
    from
        (select
            ... over (partition by FIELD, other2, ...)
            ... over (partition by FIELD, other3, ...)
            ... over (partition by FIELD, ...)
            ...
        )
    where
        SOME_FILTER(FIELD)

Shoule be transformed into that:

    select
        ...
    from
        (select
            ... over (partition by FIELD, other2, ...)
            ... over (partition by FIELD, other3, ...)
            ... over (partition by FIELD, ...)
            ...
         where
            SOME_FILTER(FIELD)
        )

I verfied that this issue exists with PostgreSQL 9.2 beta 1
as well as the PostgreSQL 9.1 release.

Is there any work on this issue? Or am I wrong and this
optimization should not be implemented for some reason?


Regards,
Volker

-- 
Volker Grabsch
---<<(())>>---

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to