fusheng-rd commented on PR #47418:
URL: https://github.com/apache/spark/pull/47418#issuecomment-2245338609

   > I haven't reviewed the code changes in the pr yet, but:
   > 
   > 1. The PR title should reflect the work done in the current PR as much as 
possible, the current title looks more like a Jira title
   > 2. Please ensure the completeness of the PR description, and the following 
parts are also required to be filled in:
   > 
   > ```
   > ### Does this PR introduce _any_ user-facing change?
   > 
   > ### How was this patch tested?
   > 
   > ### Was this patch authored or co-authored using generative AI tooling?
   > ```
   > 
   > 3. In the PR description, it mentions `in 
InsertIntoHadoopFsRelationCommand, it can greatly reduce the pressure on hive 
and improve the efficiency of task execution.` Is this quantifiable?
   
   When a table named A with millions of partitions executes the following sql:
   ```
   INSERT OVERWRITE TABLE A PARTITION(event_day, event_type) SELECT id, 
event_day, event_type from B where event_day = '20240712' 
   ```
   It will appear in hive that all partitions of table A are fetched at once, 
which is very likely to cause slow queries in hive metastore and even drag down 
the overall performance of Hive's metadata queries. At the same time, the job 
execution is very slow.
   
    After this PR, only the specified partition and its sub-partitions are 
fetched, which takes milliseconds to seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to