[GitHub] [arrow-ballista] mingmwang commented on issue #284: Partitioning reasoning in DataFusion and Ballista

GitBox Wed, 28 Sep 2022 03:37:44 -0700


mingmwang commented on issue #284:
URL: https://github.com/apache/arrow-ballista/issues/284#issuecomment-1260716387


   > I'll add the addition of range partitioning as well to this list - 
currently normal sorts are not running in parallel / distributed fashion.
   
   Yes, currently there are couple of gaps in the physical plan phase.  The 
ExecutionPlan trait need to be enhanced also.
   Considering below SQL, ideally it should only need 1 or 2 remote shuffle 
exchanges. 
   Today SparkSQL has 4 shuffle exchanges, not sure how many remote shuffles 
PrestoSQL has.
   
   ````
   select cntry_id, count(*), count(*) over (partition by cntry_id) as w,
   count(*) over (partition by curncy_id) as c,
   count(cntry_desc) over (partition by curncy_id,cntry_desc) as d,
   count(cntry_desc) over (partition by cntry_id, curncy_id,cntry_desc) as e
   from dw_countries group by cre_date, cntry_id, curncy_id,cntry_desc having 
cntry_id = 3;
   
   ````
   
   
   I'm  working on an experimenting rule for this now and also try to verify a 
new optimization process, if it is proved. it will be much easy to write new 
optimization rules. And the same methods can be applied to logical optimization 
rules as well. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-ballista] mingmwang commented on issue #284: Partitioning reasoning in DataFusion and Ballista

Reply via email to