Github user viirya commented on the issue:
https://github.com/apache/spark/pull/18956
The reason `PullupCorrelatedPredicates` produces unresolved plan:
The query causing the problem in `subquery_in_having.q` looks like:
select b.key, min(b.value)
from src b
group by b.key
having b.key in ( select a.key
from src a
where a.value > 'val_9' and a.value = min(b.value)
)
order by b.key
;
The optimized plan looks like:
'Sort [key#201 ASC NULLS FIRST], true
+- 'Project [key#201, min(value)#204]
+- 'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
: +- Project [key#206, value#207]
: +- Filter (value#207 > val_9)
: +- InMemoryRelation [key#206, value#207], true, 5,
StorageLevel(disk, memory, deserialized, 1 replicas), src
: +- HiveTableScan [key#0, value#1],
HiveTableRelation `default`.`src`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]
+- Aggregate [key#201], [key#201, min(value#202) AS min(value)#204,
min(value#202) AS min(value#202)#209]
+- InMemoryRelation [key#201, value#202], true, 5,
StorageLevel(disk, memory, deserialized, 1 replicas), src
+- HiveTableScan [key#0, value#1], HiveTableRelation
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0,
value#1]
Before `PullupCorrelatedPredicates` rule, the subquery in `ListQuery` looks
like:
Project [key#206]
+- Filter ((value#207 > val_9) && (value#207 = outer(min(value)#204)))
+- InMemoryRelation [key#206, value#207], true, 5,
StorageLevel(disk, memory, deserialized, 1 replicas), src
+- HiveTableScan [key#0, value#1], HiveTableRelation
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0,
value#1]
Currently the `In` predicate in `Filter` is resolved.
After the rule, the subquery looks like:
Project [key#206, value#207]
+- Filter (value#207 > val_9)
+- InMemoryRelation [key#206, value#207], true, 5,
StorageLevel(disk, memory, deserialized, 1 replicas), src
+- HiveTableScan [key#0, value#1], HiveTableRelation
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0,
value#1]
Notice `key#206` has been added into `Project` and the condition `value#207
= outer(min(value)#204)` has been pulled out to top `Filter`:
'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
Because `In.checkInputDataTypes` checks if the size of left columns
(key#201) matches the size of subquery output (key#206, value#207), so it fails
and return `false` for `In.resolved`.
The unresolved `Filter` doesn't cause problem before because it will be
converted to `Join` by `RewritePredicateSubquery` rule later. But it has been
detected by this structural integrity check.
By modifying `In.checkInputDataTypes`, we can solve this issue. I'd submit
another PR for it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]