[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...

viirya Wed, 16 Aug 2017 03:37:12 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/18956
  
    The reason `PullupCorrelatedPredicates` produces unresolved plan:
    
    The query causing the problem in `subquery_in_having.q` looks like:
    
        select b.key, min(b.value)
        from src b
        group by b.key
        having b.key in ( select a.key
                        from src a
                        where a.value > 'val_9' and a.value = min(b.value)
                        )
        order by b.key
        ;
    
    The optimized plan looks like:
    
        'Sort [key#201 ASC NULLS FIRST], true
        +- 'Project [key#201, min(value)#204]
           +- 'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
              :  +- Project [key#206, value#207]
              :     +- Filter (value#207 > val_9)
              :        +- InMemoryRelation [key#206, value#207], true, 5, 
StorageLevel(disk, memory, deserialized, 1 replicas), src
              :              +- HiveTableScan [key#0, value#1], 
HiveTableRelation `default`.`src`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]
           +- Aggregate [key#201], [key#201, min(value#202) AS min(value)#204, 
min(value#202) AS min(value#202)#209]
           +- InMemoryRelation [key#201, value#202], true, 5, 
StorageLevel(disk, memory, deserialized, 1 replicas), src
           +- HiveTableScan [key#0, value#1], HiveTableRelation 
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, 
value#1]
    
    Before `PullupCorrelatedPredicates` rule, the subquery in `ListQuery` looks 
like:
    
        Project [key#206]
        +- Filter ((value#207 > val_9) && (value#207 = outer(min(value)#204)))
           +- InMemoryRelation [key#206, value#207], true, 5, 
StorageLevel(disk, memory, deserialized, 1 replicas), src
                 +- HiveTableScan [key#0, value#1], HiveTableRelation 
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, 
value#1]
    
    Currently the `In` predicate in `Filter` is resolved.
    
    After the rule, the subquery looks like:
    
        Project [key#206, value#207]
        +- Filter (value#207 > val_9)
           +- InMemoryRelation [key#206, value#207], true, 5, 
StorageLevel(disk, memory, deserialized, 1 replicas), src
                 +- HiveTableScan [key#0, value#1], HiveTableRelation 
`default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, 
value#1]
    
    Notice `key#206` has been added into `Project` and the condition `value#207 
= outer(min(value)#204)` has been pulled out to top `Filter`:
    
         'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
    
    Because `In.checkInputDataTypes` checks if the size of left columns 
(key#201) matches the size of subquery output (key#206, value#207), so it fails 
and return `false` for `In.resolved`.
    
    The unresolved `Filter` doesn't cause problem before because it will be 
converted to `Join` by `RewritePredicateSubquery` rule later. But it has been 
detected by this structural integrity check.
    
    By modifying `In.checkInputDataTypes`, we can solve this issue. I'd submit 
another PR for it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18956: [SPARK-21726][SQL] Check for structural integrity of the...

Reply via email to