hemanthumashankar0511 commented on PR #6317: URL: https://github.com/apache/hive/pull/6317#issuecomment-3928039615
> > > I have a question here: > > > ``` > > > Set<TableDesc> processedTables = new HashSet<>(); > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why are we storing `TableDesc` object? Can't we just do it via `tableDesc.getFullTableName()`, it is like just storing the `tableName` vs the `TableDesc`. The `equals` & `hashcode` seams comparatively heavier than a `String` > > > > > > I stuck with the TableDesc object to be safe with things like self-joins (e.g., table A join table A). If we just checked the table name, we’d incorrectly skip the second configuration. > > Also, since the planner reuses the exact same object instance for partitions, the Set check is mostly just comparing memory addresses. This is actually faster than hashing and comparing Strings. > > I agree with using as light comparison as possible, so I believe it's worth discovering String comparison (given maximum collision in case of comparing the same TableDesc objects just as many times as many partitions we have) > > btw: @hemanthumashankar0511 you mentioned self-joins and safety: how is it a risk? in case of a self-join, does "we’d incorrectly skip the second configuration" true? does it mean different TableDesc objects that should be considered twice by this configuration method > > regarding: "Set check is mostly just comparing memory addresses" <- this can be true though, if you mean "==" comparison: > > https://github.com/apache/hive/blob/7060d94843fdbc548445db6aac84dd60b44641ee/ql/src/java/org/apache/hadoop/hive/ql/plan/TableDesc.java#L236-L238 > > ``` > public boolean equals(Object o) { > if (o == this) { > return true; > } > ``` > > so from this point of view, this is fine, however regarding "This is actually faster than hashing and comparing Strings.", I think hashing will happen anyway in case of a Hash-based collection, and in the same bucket is where the equals() plays a role, correct me if I'm wrong @abstractdog Thanks for the correction on the hashing. You are totally right, I didn't realize `HashSet` still computes the hash before checking equality. I appreciate you pointing that out. Regarding why I used the `TableDesc` object instead of just the `tableName` string, my main worry was accidentally breaking things like column pruning. I'm still very new to Hive internals, so I just tried running a quick test to see what the planner does. I ran an `EXPLAIN FORMATTED` on a simple self-join: ```sql SELECT a.key, b.value FROM src a JOIN src b ON a.key = b.key; ``` From what I could understand in the output, it looks like it actually generates two different configurations for the `src` table: * For alias **`a`**, the `TableScan` shows `"columns:":["key"]` * For alias **`b`**, the `TableScan` shows `"columns:":["key","value"]` My fear is that if we just check the string `"src"`, the code might configure the job for alias **`a`** (only fetching `key`), mark `"src"` as done, and then completely skip alias **`b`**. That would mean alias **`b`** will never get configured to fetch the `value` column, which might cause an issue? I thought using `Set<TableDesc>` was safer because its `equals()` method would notice the column lists are different and make sure both get configured. Since the patch speeds things up a lot by skipping thousands of partitions down to just the unique tables, I thought object comparison would be safe. Please don't hesitate to correct me if I'm totally wrong in my understanding of this or if I'm misinterpreting the explain plan. I'm still learning how all these pieces fit together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
