[GitHub] [spark] rf972 commented on pull request #29695: [SPARK-22390][SPARK-32833][SQL] [WIP]JDBC V2 Datasource aggregate push down

GitBox Wed, 06 Jan 2021 09:18:17 -0800


rf972 commented on pull request #29695:
URL: https://github.com/apache/spark/pull/29695#issuecomment-755437981



   Thanks @huaxingao for the fixes !  We are making it further in our testing, 
but did find a few issues running tpch with the latest patch.  We did our best 
here to translate the failing cases into examples that fit into 
JDBCV2Suite.scala.
   
   We noticed an issue in the case where only some of the filters are pushed, 
but yet the aggregate operation is still being pushed down.  This results in an 
exception, and we believe that we should only push down aggregates if all of 
the filters can be pushed down.  Here is an example test with a filter 
containing a UDF, which cannot be pushed down:
   
   ```
   val df1 = sparkSession.table("h2.test.employee")
   val sub2 = udf { (x: String) => x.substring(0, 3) }
   val name = udf { (x: String) => x.matches("cat|dav|amy") }
   val df2 = df1.select($"SALARY", $"BONUS", sub2($"NAME").as("nsub2"))
                .filter("SALARY > 100")
                .filter(name($"nsub2"))
                .agg(avg($"SALARY").as("avg_salary"))
   df2.explain(true)
   df2.show()
   ```
   
   Another issue we found is seen by the below example.
   ```
   val df1 = sparkSession.table("h2.test.employee")
   df1.filter($"DEPT" > 0 && $"SALARY" >= 0.05).agg(sum($"BONUS" * 
$"SALARY")).show()
   ```
   
   We noticed that upper case seems to be needed for aggregates.  So while 
these tests pass,
   ```
   df1.filter($"dept" > 0 && $"salary" > 9000).show()
   df1.filter($"dept" > 0 && $"salary" > 9000).agg(sum($"SALARY")).show()
   ```
   
   other tests like the below fail with lower case in the aggregate.
   `df1.filter($"dept" > 0 && $"salary" > 9000).agg(sum($"salary")).show()`
   
   This case issue is admittedly a nit, but we saw it in our testing, so we 
decided to bring it up.
   
   As always please let us know if more details are needed.  Thanks !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rf972 commented on pull request #29695: [SPARK-22390][SPARK-32833][SQL] [WIP]JDBC V2 Datasource aggregate push down

Reply via email to