hemanthumashankar0511 opened a new pull request, #6344:
URL: https://github.com/apache/hive/pull/6344
### What changes were proposed in this pull request?
Added a new method `getDistinctTableDescs()` in `MapWork` that returns the
unique `TableDesc` objects used by the map task, and updated `configureJobConf`
to use it.
Before this change, the deduplication logic was sitting inside
`configureJobConf`:
```java
Set<String> processedTables = new HashSet<>();
for (PartitionDesc partition : aliasToPartnInfo.values()) {
TableDesc tableDesc = partition.getTableDesc();
if (tableDesc != null &&
!processedTables.contains(tableDesc.getTableName())) {
processedTables.add(tableDesc.getTableName());
PlanUtils.configureJobConf(tableDesc, job);
}
}
```
After this change, that logic lives in `getDistinctTableDescs()` and
`configureJobConf` just calls it cleanly:
```java
for (TableDesc tableDesc : getDistinctTableDescs()) {
PlanUtils.configureJobConf(tableDesc, job);
}
```
### Why are the changes needed?
Callers like `KafkaDagCredentialSupplier` that only care about tables are
currently forced to loop through all partitions in `aliasToPartnInfo` just to
get the `TableDesc` objects. A table can have thousands of partitions but only
one `TableDesc`, so everyone ends up writing the same boilerplate deduplication
loop.
This method gives callers a clean way to get unique tables directly from
`MapWork` without reinventing the wheel every time.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I tested this locally by attaching a debugger to the test run and checking
two scenarios:
**Self-join** — I wanted to make sure deduplication wouldn't accidentally
skip anything:
```sql
SELECT * FROM test t1 JOIN test t2 USING(a);
```
Confirmed that both aliases point to the exact same `TableDesc` instance in
memory, so the table only gets configured once as expected.
**Cross-database join** — I wanted to make sure tables with the same name
from different databases don't collide:
```sql
SELECT * FROM db1.test_cross t1 JOIN db2.test_cross t2 USING(a);
```
Confirmed that `getTableName()` returns fully qualified names like
`db1.test_cross` and `db2.test_cross` as distinct strings, so both tables get
configured correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]