sam created SPARK-24425: --------------------------- Summary: Regression from 1.6 to 2.x - Spark no longer respects input partitions, unnecessary shuffle required Key: SPARK-24425 URL: https://issues.apache.org/jira/browse/SPARK-24425 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Reporter: sam
I think this is a regression. We used to be able to easily control the number of output files / tasks based on num files and coalesce. Now I have to use `repartition` to get the desired num files / partitions which is unnecessarily expensive. I've tried playing with spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes to see if I can force the conventional behaviour. {code:java} val ss = SparkSession.builder().appName("uber-cp").master(conf.master()) .config("spark.sql.files.maxPartitionBytes", 1) .config("spark.sql.files.openCostInBytes", Long.MaxValue) {code} This didn't work. Spark just squashes all my parquet files into less partitions. Suggest a simple `option` on DataFrameReader that can disable this (or enable it, default behaviour should be same as 1.6). This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if SPARK-5997 was implemented this ticket wouldn't really be necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org