hemanthumashankar0511 opened a new pull request, #6317:
URL: https://github.com/apache/hive/pull/6317

   What changes were proposed in this pull request?
   This PR optimizes the configureJobConf method in MapWork.java to eliminate 
redundant job configuration calls during the map phase initialization.
   
   Modified File: ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java
   
   Logic Change: Introduced a Set<TableDesc> within the partition iteration 
loop.
   
   Mechanism: The code now checks if a TableDesc has already been processed 
before invoking PlanUtils.configureJobConf(tableDesc, job).
   
   Result: The configuration logic, which includes expensive operations like 
loading StorageHandlers via reflection, is now executed only once per unique 
table, rather than once per partition.
   
   Why are the changes needed?
   Performance Bottleneck in Job Initialization: Currently, the 
MapWork.configureJobConf method iterates over aliasToPartnInfo.values(), which 
contains an entry for every single partition participating in the scan. Inside 
this loop, it calls PlanUtils.configureJobConf for every partition.
   
   The Issue:
   
   Redundancy: If a query reads 10,000 partitions from the same table, 
PlanUtils.configureJobConf is called 10,000 times with the exact same TableDesc.
   
   Expensive Operations: PlanUtils.configureJobConf invokes 
HiveUtils.getStorageHandler, which uses Java Reflection (Class.forName) to load 
the storage handler class. Repeatedly performing reflection and credential 
handling for thousands of identical partition objects adds significant, 
avoidable overhead to the job setup phase.
   
   Impact of Fix:
   
   Complexity Reduction: Reduces the configuration complexity from O(N) (where 
N is the number of partitions) to O(T) (where T is the number of unique tables).
   
   Scalability: significantly improves the startup time for jobs scanning large 
numbers of partitions.
   
   Safety: The worst-case scenario (single-partition reads) incurs only the 
negligible cost of a HashSet instantiation and a single add operation, 
preserving existing performance for small jobs.
   
   
   Does this PR introduce any user-facing change?
   No. This is an internal optimization to the MapWork plan generation phase. 
While users may experience faster job startup times for queries involving large 
numbers of partitions, there are no changes to the user interface, SQL syntax, 
or configuration properties.
   
   How was this patch tested?
   The patch was verified using local unit tests in the ql (Query Language) 
module to ensure no regressions were introduced by the optimization.
   
   1. Build Verification: Ran a clean install on the ql module to ensure 
compilation and dependency integrity.
   
   Bash
   mvn clean install -pl ql -am -DskipTests
   2. Unit Testing: Executed relevant tests in the ql module, specifically 
targeting the planning logic components to verify that MapWork configuration 
remains correct.
   
   Bash
   mvn test -pl ql -Dtest=TestMapWork
   mvn test -pl ql -Dtest="org.apache.hadoop.hive.ql.plan.*"
   3. Logic Verification: Verified that the deduplication logic correctly 
handles TableDesc objects and that configureJobConf is still called exactly 
once for each unique table, preserving the correctness of the job configuration 
while removing redundant calls.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to