sam created SPARK-24425:
---------------------------
Summary: Regression from 1.6 to 2.x - Spark no longer respects
input partitions, unnecessary shuffle required
Key: SPARK-24425
URL: https://issues.apache.org/jira/browse/SPARK-24425
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.2
Reporter: sam
I think this is a regression. We used to be able to easily control the number
of output files / tasks based on num files and coalesce. Now I have to use
`repartition` to get the desired num files / partitions which is unnecessarily
expensive.
I've tried playing with spark.sql.files.maxPartitionBytes and
spark.sql.files.openCostInBytes to see if I can force the conventional
behaviour.
{code:java}
val ss = SparkSession.builder().appName("uber-cp").master(conf.master())
.config("spark.sql.files.maxPartitionBytes", 1)
.config("spark.sql.files.openCostInBytes", Long.MaxValue)
{code}
This didn't work. Spark just squashes all my parquet files into less partitions.
Suggest a simple `option` on DataFrameReader that can disable this (or enable
it, default behaviour should be same as 1.6).
This relates to https://issues.apache.org/jira/browse/SPARK-5997, in that if
SPARK-5997 was implemented this ticket wouldn't really be necessary.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]