ahshahid opened a new pull request, #49155:
URL: https://github.com/apache/spark/pull/49155

   ### What changes were proposed in this pull request?
   In the rule PruneFileSourcePartitions where CatalogFileIndex gets converted 
into InMemoryFileIndex for partitioned tables, if the same tables are 
referenced multiple times ( with identical filters or otherwise or even with 
empty filters ( case being translated filter string for pushdown becomes 
empty), each leaf table will call the HMS layer to get partitions list. 
   This PR collects identical tables and its corresponding partition filters 
and makes a single call to HMS (HiveMetaStor) layer for getting the basic 
minimum partitions which statisfy each occurence.  Using the base InMemoryIndex 
, then each table can further apply its own filters ( if needed) to get the 
desired InMemoryIndex.
   
   For eg if Table A  has 2 occurences, each with Filter f1 and Filter f2.
   1) Table A.     f1
   2) Table A.     f2
   A single call to HMS will be made passing the filter condition as f1 || f2
   This will result in baseInMemoryFileIndex.
   Then 1) Table A  can apply filter f1 on this baseInMemoryFileIndex to get 
its own pruned file index.
   
   ### Why are the changes needed?
   This has been observed as a major perf bottleneck for complex queries where 
there are large number of partitions.
   In this particular client,  query compilation/execution time got increased 
to 6 hrs from 20 mins.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   validated using existing tests and added new tests in the file 
HivePruneFileSourcePartitionsSuite which validate the reduction in HMS calls. 
The correctness of the results were validated without this change ( I will 
modify the test to include result validations)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to