[ 
https://issues.apache.org/jira/browse/SPARK-55978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanley Yao updated SPARK-55978:
--------------------------------
    Summary: [Improvement] Block/Split Sampling SQL Augmentation  (was: 
Block/Split Sampling SQL Augmentation)

> [Improvement] Block/Split Sampling SQL Augmentation
> ---------------------------------------------------
>
>                 Key: SPARK-55978
>                 URL: https://issues.apache.org/jira/browse/SPARK-55978
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Stanley Yao
>            Priority: Major
>             Fix For: 4.1.1
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h2. TL;DR
>  * Points out a gap for fast development exploration via efficient table 
> sampling in Spark.
>  * Propose a SQL language augmentation that fills the gap and brings a few 
> more benefits.
>  * We’d like to gather feedback about the proposal.
>  * This doc intentionally omitted technical design due to its dependency on 
> language syntax. After we align on the language augmentation, we will present 
> the detailed design and implementation.  
> h2. Problem Statement
> Users running iterative development workflows (e.g., testing queries, tuning 
> joins, validating schemas) routinely scan full tables even when a small 
> fraction of the data would suffice. For a 10 TB table with 50,000 splits, a 
> developer who only needs to verify output shape or spot-check results is 
> forced to read all 50,000 splits. All the existing sampling methods read the 
> entire input data and discard at the sampling step. Users could workaround it 
> by manually creating a down-sampled table, but it's a time-consuming and 
> clunky process and doesn’t support multiple sample rates.This proposal fills 
> the gap by offering a method to skip some IO based on the table metadata.
> h2. Current Situation
> This is the current Spark sampling clause:
> TABLESAMPLE (
> {     percentage PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF 
> total } )
> [ REPEATABLE ( seed ) ]
> |*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|n/a|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM 
> (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
> h2.  
> Proposed Solution
> We are proposing to add one more sampling method extension highlighted green 
> below.
> TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( \{     percentage 
> PERCENT     | num_rows ROWS     | BUCKET fraction OUT OF total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n 
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n 
> PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
>  
> Meaning of “{*}SYSTEM{*}”:
>  * {*}ANSI definition{*}: “{_}The method of selection is 
> implementation-dependent. Each implementation shall describe the method of 
> selection it uses. The selection is done based on some implementation-defined 
> unit of physical storage (e.g., a page, block, or disk extent), such that 
> each unit has approximately a <sample percentage> / 100 probability of being 
> selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
>  * {*}Spark implementation{*}: In general, it maps to “RDD partition”. 
> Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different 
> natural “blocks”.
> h2. Benefit for Spark
>  * Faster exploratory queries, which is critical for large scale jobs, e.g., 
> ML, GenAI.
>  * One more way to achieve query cost savings (i.e., IO reduction).
>  * Enabled data source level sampling. DSv2 enables sampling push down (i.e., 
> *SupportsPushDownTableSample* interface) to benefit from native sampling via 
> Iceberg, Delta, and Hudi connectors.
>  * ANSI SQL compliance.
> h2. Limitations
>  * Block/Split level sampling won’t be fair/unbiased based on statistical 
> sampling. This is by design, because this sampling method doesn’t read all 
> the data. It prioritizes IO savings over strict and unbiased statistical 
> sampling distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to