viirya commented on issue #26461: [SPARK-29831][SQL] Scan Hive partitioned 
table should not dramatically increase data parallelism
URL: https://github.com/apache/spark/pull/26461#issuecomment-552946268
 
 
   @HyukjinKwon Thanks for comment!
   
   > Well, to me I actually agree with Dongjoon's point. Why don't we just 
explicitly coalesce or hints for that? There are some alternatives like 
converting Hive table scan to Spark's scan as well.
   > coalesce does not necessarily make it faster. On the flip side, users 
might get surprised by coalesce popping up suddenly.
   
   We encourage users to convert to datasource table, but there are 
inconvertible cases.
   
   We have configs for datasource table scan. But not for Hive table. It means 
we expect datasource scan has reasonable partition number, but not for Hive 
scan. For Hive table users, things gets troublesome as you need to add 
coalesce/hints for every query.
   
   I think that big parallelism gets more attentions from end-users, and causes 
more confused. Big number of partitions wastes time on task scheduling too.
   
   > This sounds fine in general but IIRC there have been several tries to 
merge big Hive partitions if I am not wrong; however, it needed a pretty big 
change which I don't think is worthy. e.g. #10572
   
   I think this should not be a change as big as that one.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to