Re: [PR] HIVE-29451: Optimize MapWork to configure JobConf once per table [hive]

via GitHub Thu, 19 Feb 2026 07:35:38 -0800


hemanthumashankar0511 commented on PR #6317:
URL: https://github.com/apache/hive/pull/6317#issuecomment-3928039615


   > > > I have a question here:
   > > > ```
   > > > Set<TableDesc> processedTables = new HashSet<>();
   > > > ```
   > > > 
   > > > 
   > > >     
   > > >       
   > > >     
   > > > 
   > > >       
   > > >     
   > > > 
   > > >     
   > > >   
   > > > Why are we storing `TableDesc` object? Can't we just do it via 
`tableDesc.getFullTableName()`, it is like just storing the `tableName` vs the 
`TableDesc`. The `equals` & `hashcode` seams comparatively heavier than a 
`String`
   > > 
   > > 
   > > I stuck with the TableDesc object to be safe with things like self-joins 
(e.g., table A join table A). If we just checked the table name, we’d 
incorrectly skip the second configuration.
   > > Also, since the planner reuses the exact same object instance for 
partitions, the Set check is mostly just comparing memory addresses. This is 
actually faster than hashing and comparing Strings.
   > 
   > I agree with using as light comparison as possible, so I believe it's 
worth discovering String comparison (given maximum collision in case of 
comparing the same TableDesc objects just as many times as many partitions we 
have)
   > 
   > btw: @hemanthumashankar0511 you mentioned self-joins and safety: how is it 
a risk? in case of a self-join, does "we’d incorrectly skip the second 
configuration" true? does it mean different TableDesc objects that should be 
considered twice by this configuration method
   > 
   > regarding: "Set check is mostly just comparing memory addresses" <- this 
can be true though, if you mean "==" comparison:
   > 
   > 
https://github.com/apache/hive/blob/7060d94843fdbc548445db6aac84dd60b44641ee/ql/src/java/org/apache/hadoop/hive/ql/plan/TableDesc.java#L236-L238
   > 
   > ```
   >   public boolean equals(Object o) {
   >     if (o == this) {
   >       return true;
   >     }
   > ```
   > 
   > so from this point of view, this is fine, however regarding "This is 
actually faster than hashing and comparing Strings.", I think hashing will 
happen anyway in case of a Hash-based collection, and in the same bucket is 
where the equals() plays a role, correct me if I'm wrong
   
   @abstractdog Thanks for the correction on the hashing. You are totally 
right, I didn't realize `HashSet` still computes the hash before checking 
equality. I appreciate you pointing that out.
   
   Regarding why I used the `TableDesc` object instead of just the `tableName` 
string, my main worry was accidentally breaking things like column pruning. I'm 
still very new to Hive internals, so I just tried running a quick test to see 
what the planner does. 
   
   I ran an `EXPLAIN FORMATTED` on a simple self-join: 
   ```sql
   SELECT a.key, b.value FROM src a JOIN src b ON a.key = b.key;
   ```
   
   From what I could understand in the output, it looks like it actually 
generates two different configurations for the `src` table:
   
   * For alias **`a`**, the `TableScan` shows `"columns:":["key"]`
   * For alias **`b`**, the `TableScan` shows `"columns:":["key","value"]`
   
   My fear is that if we just check the string `"src"`, the code might 
configure the job for alias **`a`** (only fetching `key`), mark `"src"` as 
done, and then completely skip alias **`b`**. That would mean alias **`b`**  
will never get configured to fetch the `value` column, which might cause an 
issue?
   
   I thought using `Set<TableDesc>` was safer because its `equals()` method 
would notice the column lists are different and make sure both get configured. 
Since the patch speeds things up a lot by skipping thousands of partitions down 
to just the unique tables, I thought object comparison would be safe.
   
   Please don't hesitate to correct me if I'm totally wrong in my understanding 
of this or if I'm misinterpreting the explain plan. I'm still learning how all 
these pieces fit together.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29451: Optimize MapWork to configure JobConf once per table [hive]

Reply via email to