abstractdog commented on PR #6317: URL: https://github.com/apache/hive/pull/6317#issuecomment-3932359415
> @abstractdog and @ayushtkn, I wanted to follow up properly on both points raised here. > > First, @abstractdog, thank you for correcting me on how `HashSet` works! I genuinely didn't realize it always computes `hashCode()` first before even getting to `equals()`. I was wrong to claim the Set check was "mostly just comparing memory addresses," and I really appreciate you taking the time to explain that clearly. > > Regarding the self-join safety concern, I decided to actually debug this locally. I attached a debugger to a test run, put a breakpoint inside `configureJobConf`, and inspected the `aliasToPartnInfo` map while executing a self-join query: > > ```sql > SELECT * FROM test t1 JOIN test t2 USING(a); > ``` > > When I expanded `aliasToPartnInfo` in the debugger, I could see two entries: one for alias `t1` and one for alias `t2`. Both PartitionDesc objects had their tableDesc field pointing to the exact same @ identity number in the debugger, confirming they are the exact same Java object instance in memory. > > So, my original safety argument was wrong! I thought that a self-join might produce two distinct `TableDesc` instances with different column configurations, but that's not what happens. Hive reuses the exact same `TableDesc` instance for all aliases of the same underlying table. > > Because of this, `Set<TableDesc>` and `Set<String>` behave identically in this scenario, they both deduplicate correctly without skipping anything. > > I am more than happy to switch to using `Set<String>` via `tableDesc.getTableName()` as you suggested. It is definitely lighter, and the behavior is exactly the same. I'll update the patch right away. thanks a lot @hemanthumashankar0511 for the detailed analysis, I really appreciate that! I would like to ask you to confirm 1 more scenario, or clarify something: you're using `tableDesc.getTableName()`, is it fully-qualified name, or just table name? what if the same table is joined from different databases, like `db1.a JOIN db2.a` ? there is a chance it's not a problem because they fall to separate `MapWork`s, but it's still worth a quick check, thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
