[GitHub] [spark] cloud-fan commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

GitBox Mon, 11 Nov 2019 23:25:44 -0800

cloud-fan commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned 
table should not dramatically increase data parallelism
URL: https://github.com/apache/spark/pull/26461#issuecomment-552769026
 
 
   I agree with the problem mentioned by @viirya , but I'm not sure this config 
is the right cure. Users still need to know the big parallelism problem and set 
the config carefully.
   
   The file source config `spark.sql.files.maxPartitionBytes` is much simpler 
to use. It defines how much data you want each task to process, and mostly you 
don't need to change it for different queries.
   
   `spark.default.parallelism` doesn't really affect data source scan AFAIK. We 
do have a similar problem to set the number of reducers and we solved in with 
the recent adaptive execution work.
   
   I'm OK to have a config for hive table scan, but we should make it simple 
set.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

Reply via email to