Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14411
  
    @hvanhovell , thanks for sharing the blog. I will read thru. It's nice to 
see the implementation of NOT IN this way. I have an idea to do it differently 
but let's move this to another place.
    
    On the SAMPLE issue you raised, I think we should not flag an error. Here 
is what I tested:
    
    Seq((1,1), (2,2)).toDF("c1","c2").createOrReplaceTempView("t1")
    Seq((1,1), (2,2)).toDF("c3","c4").createOrReplaceTempView("t2")
    
    scala> sql("select * from t1 where exists (select 1 from t2 tablesample(10 
percent) s where c3=c1)").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter exists#29
       :  +- 'SubqueryAlias exists#29
       :     +- 'Project [unresolvedalias(1, None)]
       :        +- 'Filter ('c3 = 'c1)
       :           +- 'Sample 0.0, 0.1, false, 159
       :              +- 'UnresolvedRelation `t2`, s
       +- 'UnresolvedRelation `t1`
    
    From the parser, the correlated predicate in the Filter operation is after 
the sampling operation. We should be able to treat the semantic of the sampling 
as an one-time execution and being reused for every input from the outer table. 
Using the analogy I used for LIMIT as described in the JIRA SPARK=16804, the 
SAMPLE operation is not on the correlation path and therefore the move of the 
correlated predicate above the scope of the subquery does not change the 
semantic of the query.
    
    Your thoughts, please!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to