[
https://issues.apache.org/jira/browse/HIVE-24376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285335#comment-17285335
]
Zoltan Haindrich commented on HIVE-24376:
-----------------------------------------
[~jcamachorodriguez] this may be also cause some issues in some other cases;
consider the following:
* TS[1] and TS[2] both scan the same table
* in JOIN[3] - the output of TS[1] is joined with TS[2]
* TS[2] is being filtered by a semijoin input ; say it keeps 1% of the original
data
* because the output of TS[2] is small - MapJoinConversion have selected plain
MapJoin as best option
* now SWO merges the TS[1] and TS[2] scans and removes the semijoin edge
* the mapjoin will encounter the full TS[2] table content - which might be too
much
> SharedWorkOptimizer may retain the SJ filter condition during RemoveSemijoin
> mode
> ----------------------------------------------------------------------------------
>
> Key: HIVE-24376
> URL: https://issues.apache.org/jira/browse/HIVE-24376
> Project: Hive
> Issue Type: Improvement
> Reporter: Zoltan Haindrich
> Priority: Major
>
> the mode name is also a bit confusing..but here is what happens:
> {code}
> TS[A1] -> ...
> TS[A2] -> JOIN
> TS[B] -> JOIN
> {code}
> we have an SJ edge between TS[B] -> TS[A2] to communicate informations about
> the join keys; lets assume the reducation ratio was r.
> RemoveSemijoin right now does the following:
> * removes the semijoin edge (so TS[A2] will become a full scan)
> * merges TS[A1] and TS[A2]
> w.r.t to read data from disk: this is great - we accessed A twice; from which
> 1 was a full scan - and now we only read it once.
> but from row traffic perspective: TS[A2] emits more rows from now on because
> we dont have the r ratio semijoin reduction anymore.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)