viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism URL: https://github.com/apache/spark/pull/26461#issuecomment-552627966 > The optimal value for each table is unknown, isn't it? This PR doesn't give any clue for the default value for this conf because of that. Like `spark.default.parallelism`, we don't have an optimal value for each job too. I was making `spark.default.parallelism` as default for this conf, but in the end I leave it optional to keep current behavior possibly. I am also considering @cloud-fan's suggestion, to use data size to determine if adding coalesce or not. > First, each end-users know their data and their query, but the cluster operator doesn't. IMO, this is a configuration for the end-users, not the cluster operator. Well, I think end-users usually do now know why there is a union and the job has big parallelism. This is implementation details under Hive Scan node. Users need dig into source code, or ask cluster operators, in order to know that. End-users know their data and queries, it does not mean they also know where the big parallelism comes from. Because they know data and queries, they are more confused because there is no point to have the big parallelism based on their data and queries. This is a config can be used by both end-users and cluster operators. Before this, cluster operators can not do anything. It is easier to set a config value, but it is hard to insert a hint into end-users queries. > Second, this will enforce for all Hive tables without allowing exceptions. That's not good. With Hint, we can do fine-grained tuning per tables and per queries. This sounds good point. However, for tables and queries needed for tuning, you still can change config value or disable it and turn to hints. This config is a guardian for preventing unreasonable number of partitions seen when reading Hive partitioned table.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
