[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

GitBox Mon, 21 Jun 2021 20:45:53 -0700


Jimexist edited a comment on issue #299:
URL: 
https://github.com/apache/arrow-datafusion/issues/299#issuecomment-859529899



   > > > @Jimexist
   > > > No, but AFAIK you can pre-partition based on the partition expression, 
like for example we do for hash joins.
   > > > You have to execute the partition too in the implementation of the 
window functions, but each partition has all of the equal partition values 
after doing a hash repartition.
   > > > So a `HashPartition(partition_by_expr) -> Window(partition_by_expr, 
order_by)` (per partition), should be the same as `Window(partition_by_expr, 
order_by)` (on 1 partition)
   > > 
   > > 
   > > i wonder why [postgres decided to use sort instead of hash partition 
regardless](https://github.com/postgres/postgres/blob/c30f54ad732ca5c8762bb68bbe0f51de9137dd72/src/backend/executor/nodeWindowAgg.c#L10-L12)
   > > ```
   > > # explain select max(c2) over (partition by c3) from test;
   > >                             QUERY PLAN
   > > -------------------------------------------------------------------
   > >  WindowAgg  (cost=44.96..55.81 rows=620 width=6)
   > >    ->  Sort  (cost=44.96..46.51 rows=620 width=6)
   > >          Sort Key: c3
   > >          ->  Seq Scan on test  (cost=0.00..16.20 rows=620 width=6)
   > > (4 rows)
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > maybe most likely due to code reuse?
   > 
   > PostgreSQL uses a minimal amount of multithreading, as it is designed 
mostly for transactional processing (OLTP) on smaller datasets. For execution 
on one thread, doing extra work would slow it down a bit, so it would be better 
to not use that at all. For hash join we do the same too, the partitioning is 
only applied when `concurrency>1`.
   > Only for really big tables / costly queries PostgreSQL will opt to use 
multiple workers (which will be visible in the query plan), but not sure 
whether it will even use hash repartitioning in that case.
   > 
   > I believe e.g. Spark always does a partitioning based on partition by, 
which makes it execute much faster / scalable in the presence of a partition by 
clause as each worker/thread can execute each part individually.
   
   that's a valid point. i guess here my plan is / continues to be:
   1. implement a correct version using global sort that covers partition and 
order by and window frame
   2. setup integration tests that compare results
   3. [come up with a more realistic benchmark dataset that's much larger than 
100 rows](https://github.com/apache/arrow-datafusion/issues/565)
   4. [migrate to use repartition for inter-partition 
parallism](https://github.com/apache/arrow-datafusion/pull/569)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Jimexist edited a comment on issue #299: Support window functions with PARTITION BY clause

Reply via email to