[
https://issues.apache.org/jira/browse/SPARK-55978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stanley Yao updated SPARK-55978:
--------------------------------
Description:
h2. TL;DR
* Points out a gap for fast development exploration via efficient table
sampling in Spark.
* Propose a SQL language augmentation that fills the gap and brings a few more
benefits.
* We’d like to gather feedback about the proposal.
* This doc intentionally omitted technical design due to its dependency on
language syntax. After we align on the language augmentation, we will present
the detailed design and implementation.
h2. Problem Statement
Users running iterative development workflows (e.g., testing queries, tuning
joins, validating schemas) routinely scan full tables even when a small
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a
developer who only needs to verify output shape or spot-check results is forced
to read all 50,000 splits. All the existing sampling methods read the entire
input data and discard at the sampling step. Users could workaround it by
manually creating a down-sampled table, but it's a time-consuming and clunky
process and doesn’t support multiple sample rates.This proposal fills the gap
by offering a method to skip some IO based on the table metadata.
h2. Current Situation
This is the current Spark sampling clause:
TABLESAMPLE (
{ percentage PERCENT | num_rows ROWS | BUCKET fraction OUT OF total
}
)
[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|n/a|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
h2. Proposed Solution
We are proposing to add one more sampling method extension highlighted green
below.
TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( \{ percentage
PERCENT | num_rows ROWS | BUCKET fraction OUT OF total }
)
[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
Meaning of “{*}SYSTEM{*}”:
* {*}ANSI definition{*}: “{_}The method of selection is
implementation-dependent. Each implementation shall describe the method of
selection it uses. The selection is done based on some implementation-defined
unit of physical storage (e.g., a page, block, or disk extent), such that each
unit has approximately a <sample percentage> / 100 probability of being
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
* {*}Spark implementation{*}: In general, it maps to “RDD partition”.
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different
natural “blocks”.
h2. Benefit for Spark
* Faster exploratory queries, which is critical for large scale jobs, e.g.,
ML, GenAI.
* One more way to achieve query cost savings (i.e., IO reduction).
* Enabled data source level sampling. DSv2 enables sampling push down (i.e.,
*SupportsPushDownTableSample* interface) to benefit from native sampling via
Iceberg, Delta, and Hudi connectors.
* ANSI SQL compliance.
h2. Limitations
* Block/Split level sampling won’t be fair/unbiased based on statistical
sampling. This is by design, because this sampling method doesn’t read all the
data. It prioritizes IO savings over strict and unbiased statistical sampling
distribution.
was:
h2. TL;DR
* Points out a gap for fast development exploration via efficient table
sampling in Spark.
* Propose a SQL language augmentation that fills the gap and brings a few more
benefits.
* We’d like to gather feedback about the proposal.
* This doc intentionally omitted technical design due to its dependency on
language syntax. After we align on the language augmentation, we will present
the detailed design and implementation.
h2. Problem Statement
Users running iterative development workflows (e.g., testing queries, tuning
joins, validating schemas) routinely scan full tables even when a small
fraction of the data would suffice. For a 10 TB table with 50,000 splits, a
developer who only needs to verify output shape or spot-check results is forced
to read all 50,000 splits. All the existing sampling methods read the entire
input data and discard at the sampling step. Users could workaround it by
manually creating a down-sampled table, but it's a time-consuming and clunky
process and doesn’t support multiple sample rates.This proposal fills the gap
by offering a method to skip some IO based on the table metadata.
h2. Current Situation
This is the current Spark sampling clause:
TABLESAMPLE (
{ percentage PERCENT | num_rows ROWS | BUCKET fraction OUT OF total
} )
[ REPEATABLE ( seed ) ]
|*Feature*| *Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|n/a|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
h2.
Proposed Solution
We are proposing to add one more sampling method extension highlighted green
below.
TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( \{ percentage
PERCENT | num_rows ROWS | BUCKET fraction OUT OF total }
)
[ REPEATABLE ( seed ) ]
|*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
|*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
PERCENT)|TABLESAMPLE BERNOULLI (n)|
|*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n
PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
|*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
|*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
|*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
Meaning of “{*}SYSTEM{*}”:
* {*}ANSI definition{*}: “{_}The method of selection is
implementation-dependent. Each implementation shall describe the method of
selection it uses. The selection is done based on some implementation-defined
unit of physical storage (e.g., a page, block, or disk extent), such that each
unit has approximately a <sample percentage> / 100 probability of being
selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
* {*}Spark implementation{*}: In general, it maps to “RDD partition”.
Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different
natural “blocks”.
h2. Benefit for Spark
* Faster exploratory queries, which is critical for large scale jobs, e.g.,
ML, GenAI.
* One more way to achieve query cost savings (i.e., IO reduction).
* Enabled data source level sampling. DSv2 enables sampling push down (i.e.,
*SupportsPushDownTableSample* interface) to benefit from native sampling via
Iceberg, Delta, and Hudi connectors.
* ANSI SQL compliance.
h2. Limitations
* Block/Split level sampling won’t be fair/unbiased based on statistical
sampling. This is by design, because this sampling method doesn’t read all the
data. It prioritizes IO savings over strict and unbiased statistical sampling
distribution.
> [Improvement] Block/Split Sampling SQL Augmentation
> ---------------------------------------------------
>
> Key: SPARK-55978
> URL: https://issues.apache.org/jira/browse/SPARK-55978
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.1
> Reporter: Stanley Yao
> Priority: Major
> Fix For: 4.1.1
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> h2. TL;DR
> * Points out a gap for fast development exploration via efficient table
> sampling in Spark.
> * Propose a SQL language augmentation that fills the gap and brings a few
> more benefits.
> * We’d like to gather feedback about the proposal.
> * This doc intentionally omitted technical design due to its dependency on
> language syntax. After we align on the language augmentation, we will present
> the detailed design and implementation.
> h2. Problem Statement
> Users running iterative development workflows (e.g., testing queries, tuning
> joins, validating schemas) routinely scan full tables even when a small
> fraction of the data would suffice. For a 10 TB table with 50,000 splits, a
> developer who only needs to verify output shape or spot-check results is
> forced to read all 50,000 splits. All the existing sampling methods read the
> entire input data and discard at the sampling step. Users could workaround it
> by manually creating a down-sampled table, but it's a time-consuming and
> clunky process and doesn’t support multiple sample rates.This proposal fills
> the gap by offering a method to skip some IO based on the table metadata.
> h2. Current Situation
> This is the current Spark sampling clause:
> TABLESAMPLE (
> { percentage PERCENT | num_rows ROWS | BUCKET fraction OUT OF
> total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|n/a|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM
> (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
> h2. Proposed Solution
>
> We are proposing to add one more sampling method extension highlighted green
> below.
> TABLESAMPLE *[ SYSTEM | BERNOULLI | <other_extensions> ]* ( \{ percentage
> PERCENT | num_rows ROWS | BUCKET fraction OUT OF total }
> )
> [ REPEATABLE ( seed ) ]
> |*Feature*|*Apache Spark*|*ANSI SQL (SQL:2003+)*|*Presto/Trino* |
> |*Row level random*|TABLESAMPLE (n PERCENT)|TABLESAMPLE BERNOULLI (n
> PERCENT)|TABLESAMPLE BERNOULLI (n)|
> |*Block/split level*|{color:#00875a}*TABLESAMPLE SYSTEM (n
> PERCENT)*{color}|TABLESAMPLE SYSTEM (n PERCENT)|TABLESAMPLE SYSTEM (n)|
> |*Hash-based bucketing*|TABLESAMPLE (BUCKET x OUT OF y)|n/a|n/a|
> |*Exact row count*|TABLESAMPLE (n ROWS)|n/a|n/a|
> |*Repeatable seed*|REPEATABLE (seed)|REPEATABLE (seed)|n/a|
>
> Meaning of “{*}SYSTEM{*}”:
> * {*}ANSI definition{*}: “{_}The method of selection is
> implementation-dependent. Each implementation shall describe the method of
> selection it uses. The selection is done based on some implementation-defined
> unit of physical storage (e.g., a page, block, or disk extent), such that
> each unit has approximately a <sample percentage> / 100 probability of being
> selected. -ISO/IEC 9075-2 (SQL/Foundation), Section 7.6{_}”
> * {*}Spark implementation{*}: In general, it maps to “RDD partition”.
> Different data sources (e.g., Iceberg, DeltaLake, Hudi) may have different
> natural “blocks”.
> h2. Benefit for Spark
> * Faster exploratory queries, which is critical for large scale jobs, e.g.,
> ML, GenAI.
> * One more way to achieve query cost savings (i.e., IO reduction).
> * Enabled data source level sampling. DSv2 enables sampling push down (i.e.,
> *SupportsPushDownTableSample* interface) to benefit from native sampling via
> Iceberg, Delta, and Hudi connectors.
> * ANSI SQL compliance.
> h2. Limitations
> * Block/Split level sampling won’t be fair/unbiased based on statistical
> sampling. This is by design, because this sampling method doesn’t read all
> the data. It prioritizes IO savings over strict and unbiased statistical
> sampling distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]