hemanthumashankar0511 opened a new pull request, #6344:
URL: https://github.com/apache/hive/pull/6344

   
   
   ### What changes were proposed in this pull request?
   
   Added a new method `getDistinctTableDescs()` in `MapWork` that returns the 
unique `TableDesc` objects used by the map task, and updated `configureJobConf` 
to use it.
   
   Before this change, the deduplication logic was sitting inside 
`configureJobConf`:
   ```java
   Set<String> processedTables = new HashSet<>();
   for (PartitionDesc partition : aliasToPartnInfo.values()) {
       TableDesc tableDesc = partition.getTableDesc();
       if (tableDesc != null && 
!processedTables.contains(tableDesc.getTableName())) {
           processedTables.add(tableDesc.getTableName());
           PlanUtils.configureJobConf(tableDesc, job);
       }
   }
   ```
   
   After this change, that logic lives in `getDistinctTableDescs()` and 
`configureJobConf` just calls it cleanly:
   ```java
   for (TableDesc tableDesc : getDistinctTableDescs()) {
       PlanUtils.configureJobConf(tableDesc, job);
   }
   ```
   
   ### Why are the changes needed?
   
   Callers like `KafkaDagCredentialSupplier` that only care about tables are 
currently forced to loop through all partitions in `aliasToPartnInfo` just to 
get the `TableDesc` objects. A table can have thousands of partitions but only 
one `TableDesc`, so everyone ends up writing the same boilerplate deduplication 
loop.
   
   This method gives callers a clean way to get unique tables directly from 
`MapWork` without reinventing the wheel every time.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   I tested this locally by attaching a debugger to the test run and checking 
two scenarios:
   
   **Self-join** — I wanted to make sure deduplication wouldn't accidentally 
skip anything:
   ```sql
   SELECT * FROM test t1 JOIN test t2 USING(a);
   ```
   Confirmed that both aliases point to the exact same `TableDesc` instance in 
memory, so the table only gets configured once as expected.
   
   **Cross-database join** — I wanted to make sure tables with the same name 
from different databases don't collide:
   ```sql
   SELECT * FROM db1.test_cross t1 JOIN db2.test_cross t2 USING(a);
   ```
   Confirmed that `getTableName()` returns fully qualified names like 
`db1.test_cross` and `db2.test_cross` as distinct strings, so both tables get 
configured correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to