[GitHub] [spark] viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

GitBox Mon, 11 Nov 2019 23:45:33 -0800

viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned 
table should not dramatically increase data parallelism
URL: https://github.com/apache/spark/pull/26461#issuecomment-552774504
 
 
   > spark.default.parallelism doesn't really affect data source scan AFAIK. We 
do have a similar problem to set the number of reducers and we solved in with 
the recent adaptive execution work.
   
   Because the default parallelism affects `maxSplitBytes`:
   
   
https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L86-L95
   
   IIUC, `spark.sql.files.maxPartitionByte` and `spark.default.parallelism` 
both affect the used split byte and so final parallelism in scan.
   
   Maybe we can also rely on `maxSplitBytes` in Hive Scan and decide 
parallelism?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

Reply via email to