Fernando Pereira created SPARK-22051: ----------------------------------------
Summary: Explicit control of number of partitions after dataframe operations (join, order...) Key: SPARK-22051 URL: https://issues.apache.org/jira/browse/SPARK-22051 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.0.0 Reporter: Fernando Pereira Priority: Minor At the moment, at least from PySpark, it is not obvious to control the number of partitions resulting from a join, a order by... but also spark.read, etc, ending up in the (in)famous 200 partitions. Of course one can do df.repartition() but most of the times it ends up reshuffling data. One workaround seems to be changing the config var sparl.sql.shuffle.partitons at runtime. However, when tuning an app performance, we might want different values / fields according to the sizes/structures of the DF, and changing a global config var several times simply doesn't feel right. Moreover it doesn't apply to all operations (e.g. spark.read) Therefore I believe it would be really a nice feature to either: - Allow the user to specify the partitioning options in those operations. E.g. df.join(df2, partitions=N, partition_cols=[col1]) - Optimize subsequent calls to repartition() to change the parameters of the latest partitioner in the execution plan, instead of instantiating and executing a new partitioner. My excuses if there is a better way of doing it or work in that direction is already in progress. I couldn't find anything satisfactory. If the community finds any of these ideas useful I can try to help implementing them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org