cloud-fan commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism URL: https://github.com/apache/spark/pull/26461#issuecomment-552769026 I agree with the problem mentioned by @viirya , but I'm not sure this config is the right cure. Users still need to know the big parallelism problem and set the config carefully. The file source config `spark.sql.files.maxPartitionBytes` is much simpler to use. It defines how much data you want each task to process, and mostly you don't need to change it for different queries. `spark.default.parallelism` doesn't really affect data source scan AFAIK. We do have a similar problem to set the number of reducers and we solved in with the recent adaptive execution work. I'm OK to have a config for hive table scan, but we should make it simple set.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
