Fernando Pereira created SPARK-22051:
----------------------------------------

             Summary: Explicit control of number of partitions after dataframe 
operations (join, order...)
                 Key: SPARK-22051
                 URL: https://issues.apache.org/jira/browse/SPARK-22051
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 2.0.0
            Reporter: Fernando Pereira
            Priority: Minor


At the moment, at least from PySpark, it is not obvious to control the number 
of partitions resulting from a join, a order by... but also spark.read, etc, 
ending up in the (in)famous 200 partitions.
Of course one can do df.repartition() but most of the times it ends up 
reshuffling data.

One workaround seems to be changing the config var sparl.sql.shuffle.partitons 
at runtime. However, when tuning an app performance, we might want different 
values / fields according to the sizes/structures of the DF, and changing a 
global config var several times simply doesn't feel right. Moreover it doesn't 
apply to all operations (e.g. spark.read)

Therefore I believe it would be really a nice feature to either:
 - Allow the user to specify the partitioning options in those operations. E.g. 
df.join(df2, partitions=N, partition_cols=[col1])
 - Optimize subsequent calls to repartition() to change the parameters of the 
latest partitioner in the execution plan, instead of instantiating and 
executing a new partitioner.

My excuses if there is a better way of doing it or work in that direction is 
already in progress. I couldn't find anything satisfactory.

If the community finds any of these ideas useful I can try to help implementing 
them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to