viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned table should not dramatically increase data parallelism URL: https://github.com/apache/spark/pull/26461#issuecomment-552946268 @HyukjinKwon Thanks for comment! > Well, to me I actually agree with Dongjoon's point. Why don't we just explicitly coalesce or hints for that? There are some alternatives like converting Hive table scan to Spark's scan as well. > coalesce does not necessarily make it faster. On the flip side, users might get surprised by coalesce popping up suddenly. We encourage users to convert to datasource table, but there are inconvertible cases. We have configs for datasource table scan. But not for Hive table. It means we expect datasource scan has reasonable partition number, but not for Hive scan. For Hive table users, things gets troublesome as you need to add coalesce/hints for every query. I think that big parallelism gets more attentions from end-users, and causes more confused. Big number of partitions wastes time on task scheduling too. > This sounds fine in general but IIRC there have been several tries to merge big Hive partitions if I am not wrong; however, it needed a pretty big change which I don't think is worthy. e.g. #10572 I think this should not be a change as big as that one.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
