huaxingao opened a new pull request #34451:
URL: https://github.com/apache/spark/pull/34451


   ### What changes were proposed in this pull request?
   
   Push down Sample to data source for better performance. If Sample is pushed 
down, it will be removed from logical plan so it will not be applied at Spark 
any more.
   
   Current Plan without Sample push down:
   ```
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Sample 0.0, 0.8, false, 157
      +- 'UnresolvedRelation [postgresql, new_table], [], false
   
   == Analyzed Logical Plan ==
   col1: int, col2: int
   Project [col1#163, col2#164]
   +- Sample 0.0, 0.8, false, 157
      +- SubqueryAlias postgresql.new_table
         +- RelationV2[col1#163, col2#164] new_table
   
   == Optimized Logical Plan ==
   Sample 0.0, 0.8, false, 157
   +- RelationV2[col1#163, col2#164] new_table
   
   == Physical Plan ==
   *(1) Sample 0.0, 0.8, false, 157
   +- *(1) Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@6dde4769 
[col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], 
 ReadSchema: struct<col1:int,col2:int>
   ```
   after Sample push down:
   ```
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Sample 0.0, 0.8, false, 187
      +- 'UnresolvedRelation [postgresql, new_table], [], false
   
   == Analyzed Logical Plan ==
   col1: int, col2: int
   Project [col1#163, col2#164]
   +- Sample 0.0, 0.8, false, 187
      +- SubqueryAlias postgresql.new_table
         +- RelationV2[col1#163, col2#164] new_table
   
   == Optimized Logical Plan ==
   RelationV2[col1#163, col2#164] new_table
   
   == Physical Plan ==
   *(1) Scan 
org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$1@65b57543 
[col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], 
PushedSample: TABLESAMPLE  0.0 0.8 false 187, ReadSchema: 
struct<col1:int,col2:int>
   ```
   The new interface is implemented using JDBC for POC and end to end test. 
TABLESAMPLE is not supported by all the databases. It is implemented using 
postgresql in this PR.
   
   ### Why are the changes needed?
   Reduce IO and improve performance. For SAMPLE, e.g. `SELECT * FROM t 
TABLESAMPLE (1 PERCENT)`, Spark retrieves all the data from table and then 
return 1% rows. It will dramatically reduce the transferred data size and 
improve performance if we can push Sample to data source side.
   
   ### Does this PR introduce any user-facing change?
   Yes. new interface `SupportsPushDownTableSample`
   
   ### How was this patch tested?
   New test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to