mingmwang commented on issue #284: URL: https://github.com/apache/arrow-ballista/issues/284#issuecomment-1260716387
> I'll add the addition of range partitioning as well to this list - currently normal sorts are not running in parallel / distributed fashion. Yes, currently there are couple of gaps in the physical plan phase. The ExecutionPlan trait need to be enhanced also. Considering below SQL, ideally it should only need 1 or 2 remote shuffle exchanges. Today SparkSQL has 4 shuffle exchanges, not sure how many remote shuffles PrestoSQL has. ```` select cntry_id, count(*), count(*) over (partition by cntry_id) as w, count(*) over (partition by curncy_id) as c, count(cntry_desc) over (partition by curncy_id,cntry_desc) as d, count(cntry_desc) over (partition by cntry_id, curncy_id,cntry_desc) as e from dw_countries group by cre_date, cntry_id, curncy_id,cntry_desc having cntry_id = 3; ```` I'm working on an experimenting rule for this now and also try to verify a new optimization process, if it is proved. it will be much easy to write new optimization rules. And the same methods can be applied to logical optimization rules as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
