[GitHub] [spark] rf972 commented on pull request #29695: [SPARK-22390][SPARK-32833][SQL] [WIP]JDBC V2 Datasource aggregate push down

GitBox Wed, 09 Dec 2020 06:34:22 -0800


rf972 commented on pull request #29695:
URL: https://github.com/apache/spark/pull/29695#issuecomment-741810936



   
   Thanks @huaxingao for the fix!  
   We were able to run the TPCH benchmark with your latest code and have these 
results for almost all of the TPCH tests.
   Although many of these tests do use aggregate, most cannot yet push down 
because of other portions of the query, which cannot be pushed down such as 
join and udf.  
   Our simulation shows that Q18 and Q20 will also benefit from aggregate 
pushdown, once Issue #1 is fixed.
   
   These test results were generated using our v2 data source 
https://github.com/rf972/s3datasource. And using our docker based spark/s3 test 
environment:  https://github.com/peterpuhov-github/dike
   
   ```
     TPCH                                                                  
Aggregate
     Test    No Pushdown         Filter    Filter, Project    Aggregate    
Difference
   ------  -------------  -------------  -----------------  -----------  
------------
        1    754,998,934    744,353,513        152,005,702  152,005,702         
    0
        2    288,283,676    240,552,680         19,640,050   19,640,050         
    0
        3    950,316,971    495,621,641         90,182,777   90,182,777         
    0
        4    925,953,149    761,524,915        179,910,653  179,910,653         
    0
        5    951,729,284    806,714,442        164,092,188  164,092,188         
    0
        6    754,998,934     14,320,106          1,565,944           17     
1,565,927
        7    951,731,111    426,753,589         91,221,930   91,221,930         
    0
        8    975,686,104    833,050,619        211,038,608  211,038,608         
    0
        9  1,070,094,601  1,047,408,449        258,031,293  258,031,293         
    0
       10    950,319,184    216,925,396         55,738,379   55,738,379         
    0
       11    360,560,535    360,554,115         50,546,842   50,546,842         
    0
       12    925,953,149    203,531,165         37,713,185   37,713,185         
    0
       13    195,318,037    195,318,037         96,823,834   96,823,834         
    0
       14    778,953,541     33,509,818          7,152,453    7,152,453         
    0
       15  1,511,407,582     58,271,128          9,143,413     Issue #1
       16    144,139,239    143,184,520         17,410,512   17,410,512         
    0
       17  1,557,907,082  1,510,046,884        164,440,699  164,440,699         
    0
       18  1,705,315,905  1,705,315,905        183,515,611     Issue #1
       19    778,953,541     40,553,750          8,311,313    8,311,313         
    0
       20    899,140,386    258,559,476         34,420,567     Issue #1
       21  2,437,362,944  2,349,560,183        498,738,573  498,738,573         
    0
       22    219,681,859    217,455,483         17,168,762     Issue #1
   ```
   
   Issue #1 above is an exception we notice in Q15, Q18, Q20, and Q22.
   
   We came up with these tests to demonstrate the issue:
   ```
   val df = sparkSession.table("h2.test.employee")
   var query1 = df.select($"DEPT", $"SALARY".as("value"))
                  .groupBy($"DEPT")
                  .agg(sum($"value").as("total"))
                  .filter($"total" > 1000)
   query1.show()
   val decrease = udf { (x: Double, y: Double) => x - y}
   var query2 = df.select($"DEPT", decrease($"SALARY", $"BONUS").as("value"), 
$"SALARY", $"BONUS")
                  .groupBy($"DEPT")
                  .agg(sum($"value"), sum($"SALARY"), sum($"BONUS"))
   query2.show()
   ```
   Please let us know if additional details are needed on this.  Thanks !
   
   We are actively looking at using larger data sets to generate meaningful 
test timing.  
   We will share timing results as soon as we are done.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rf972 commented on pull request #29695: [SPARK-22390][SPARK-32833][SQL] [WIP]JDBC V2 Datasource aggregate push down

Reply via email to