[GitHub] spark pull request: [SPARK-12476][SQL] Implement JdbcRelation#unha...

HyukjinKwon Wed, 23 Dec 2015 17:15:30 -0800

Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/10427#issuecomment-167017518
  
    Let me leave a comment. I tested some cases with this PR and looks 
generally working fine. But I would like to mention one thing that I am pretty 
sure you guys already know though. 
    
    Since now `Filter` is removed, for operation such as `count()`which does 
not require any columns, it works different internally. It looks okay with JDBC 
but it does not with others such as Parquet.
    
    For example, in case of Parquet, I implemented this locally and tested and 
found it calculating a wrong results. That produced the plans below:
    
    ```
    == Physical Plan ==
    TungstenAggregate(key=[], 
functions=[(count(1),mode=Final,isDistinct=false)], output=[count#10L])
    +- TungstenExchange SinglePartition, None
       +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#14L])
          +- Project
             +- Scan ParquetRelation[] InputPaths: 
file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-bc91cad7-b7c6-4ef0-985d-abaad921572d/part=1,
 PushedFilters: [EqualTo(a,2)]
    ```
    Here, I could find any `requiredColumns` not given and producing wrong 
results by Parquet.
    
    I tried with the original code and it showed the plan below assigning  
`requiredColumns` properly.
    ```
    == Physical Plan ==
    TungstenAggregate(key=[], 
functions=[(count(1),mode=Final,isDistinct=false)], output=[count#10L])
    +- TungstenExchange SinglePartition, None
       +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#14L])
          +- Project
             +- Filter (a#8 = 2)
                +- Scan ParquetRelation[a#8] InputPaths: 
file:/private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-ef271ec6-95e1-43ae-9b3e-1d4dae6f69c3/part=1,
 PushedFilters: [EqualTo(a,2)]
    ```
    
    The reason seems that it adds all the columns in projects and filters to 
`requiredColumns` but it could not produce any columns as it ends up with no 
columns from `Project` without any columns and without `Filter`.
    
    For this `JDBCRelation`, I tested and saw the `count()` working fine as it 
gives `1` when  `requiredColumns` is empty, producing the plan below.
    
    ```
    == Physical Plan ==
    TungstenAggregate(key=[], 
functions=[(count(1),mode=Final,isDistinct=false)], output=[count#38L])
    +- TungstenExchange SinglePartition, None
       +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#42L])
          +- Project
             +- Scan 
JDBCRelation(jdbc:h2:mem:testdb0,TEST.PEOPLE,[Lorg.apache.spark.Partition;@3e906375,{user=testUser,
 password=testPass, url=jdbc:h2:mem:testdb0, dbtable=TEST.PEOPLE})[] 
PushedFilters: [EqualTo(NAME,fred)]
    ```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12476][SQL] Implement JdbcRelation#unha...

Reply via email to