[GitHub] [spark] viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

GitBox Mon, 11 Nov 2019 13:39:12 -0800

viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned 
table should not dramatically increase data parallelism
URL: https://github.com/apache/spark/pull/26461#issuecomment-552627966
 
 
   > The optimal value for each table is unknown, isn't it? This PR doesn't 
give any clue for the default value for this conf because of that.
   
   Like `spark.default.parallelism`, we don't have an optimal value for each 
job too. I was making `spark.default.parallelism` as default for this conf, but 
in the end I leave it optional to keep current behavior possibly.
   
   I am also considering @cloud-fan's suggestion, to use data size to determine 
if adding coalesce or not.
   
   > First, each end-users know their data and their query, but the cluster 
operator doesn't. IMO, this is a configuration for the end-users, not the 
cluster operator.
   
   Well, I think end-users usually do now know why there is a union and the job 
has big parallelism. This is implementation details under Hive Scan node.  
Users need dig into source code, or ask cluster operators, in order to know 
that.
   
   End-users know their data and queries, it does not mean they also know where 
the big parallelism comes from. Because they know data and queries, they are 
more confused because there is no point to have the big parallelism based on 
their data and queries.
   
   This is a config can be used by both end-users and cluster operators. Before 
this, cluster operators can not do anything. It is easier to set a config 
value, but it is hard to insert a hint into end-users queries.
   
   > Second, this will enforce for all Hive tables without allowing exceptions. 
That's not good. With Hint, we can do fine-grained tuning per tables and per 
queries.
   
   This sounds good point. However, for tables and queries needed for tuning, 
you still can change config value or disable it and turn to hints.
   
   This config is a guardian for preventing unreasonable number of partitions 
seen when reading Hive partitioned table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism

Reply via email to