[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...

nsyca Thu, 12 Jan 2017 10:06:01 -0800

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14719
  
    @sarutak would your code be able to solve this ambiguity of df("a") in the 
join condition?
    
    ````
    val df = Seq((1,0), (2,1)).toDF("a","b")
    val df1 = df.filter(df("b") > 0)
    val result = df.filter(df("a") > 0).join(df1, df("a") === df1("a"), 
"left").select(df1("a"))
    ````
    
    Here is my understanding.
    
    At the first function
    
    `df.filter(df("a") > 0)     ... [1]`
    
    Spark implicitly creates a new Dataset. So when it tries to resolve the 
column `df("a")` in the argument of the join, the Dataset named `df` is not the 
caller of the join. The Dataset `df` is actually embedded in the new unnamed 
Dataset in `[1]`. So what we need here is a scheme to record where Datasets are 
embedded in another Dataset.
    
    Taking the above example, we can draw a tree resembling the embedded 
structure of the Dataset `result`.
    
    ````
          (result)
           select
             |
            [2]
            join
           /   \
         [1]  (df1)
       filter filter
          |     |
         (df)  (df)
    ````
    where `[1]` is the unnamed Dataset mentioned above and `[2]` is another 
unnamed Dataset.
    
    `df.filter(df("a") > 0).join(df1, df("a") === df1("a"), "left") ... [2]`
    
    When we try to resolve `df("a")` from [1]. There is only one `df` under 
`[1]` so the resolution is not ambiguous. The problem here when we are trying 
to resolve `df("a")` in the argument of the join operator, even when we have 
the embedded tree structure, how do we distinguish the ambiguity of `df("a")` 
from under `[1]` and from under `df1`?
    
    Your breath-first-search walk may hit the correct `df` first but it should 
continue to find the second `df` of the same level and if it found one, it 
should raise an exception.
    
    An interesting test scenario to verify this would be the one below:
    
    `val result = df1.join(df.filter(df("a") > 0), df("a") === df1("a"), 
"right").select(df1("a"))`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...

Reply via email to