cloud-fan commented on code in PR #54972:
URL: https://github.com/apache/spark/pull/54972#discussion_r3221254647
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##########
@@ -1039,7 +1039,16 @@ abstract class SparkStrategies extends
QueryPlanner[SparkPlan] {
execution.FilterExec(f.typedCondition(f.deserializer),
planLater(f.child)) :: Nil
case e @ logical.Expand(_, _, child) =>
execution.ExpandExec(e.projections, e.output, planLater(child)) :: Nil
- case logical.Sample(lb, ub, withReplacement, seed, child) =>
+ case logical.Sample(lb, ub, withReplacement, seed, child, sampleMethod)
=>
+ if (sampleMethod == logical.SampleMethod.System) {
Review Comment:
This branch is unreachable in normal flow — `V2ScanRelationPushDown` is
non-excludable and always handles SYSTEM samples (either removes the Sample
node or throws). Reaching it would indicate an internal invariant violation
(e.g. a custom `Optimizer` subclass that skipped `V2ScanRelationPushDown`), not
a user error, so a `SparkException.internalError` would be more accurate than a
user-facing `AnalysisException`.
The comment is also a bit misleading: it says "fell through to row-based
sampling" but the actual cause if this is ever reached is "the SYSTEM-handling
optimizer rule didn't run as expected".
##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownTableSample.java:
##########
@@ -29,11 +29,28 @@
public interface SupportsPushDownTableSample extends ScanBuilder {
/**
- * Pushes down SAMPLE to the data source.
+ * Pushes down BERNOULLI (row-level) SAMPLE to the data source.
*/
boolean pushTableSample(
double lowerBound,
double upperBound,
boolean withReplacement,
long seed);
+
+ /**
+ * Pushes down SAMPLE to the data source with the specified sampling method.
+ */
+ default boolean pushTableSample(
+ double lowerBound,
+ double upperBound,
+ boolean withReplacement,
+ long seed,
+ SampleMethod sampleMethod) {
+ if (sampleMethod == SampleMethod.SYSTEM) {
+ // If the data source hasn't overridden this method, it must have not
added support
Review Comment:
Word order — "must not have" reads more naturally than "must have not".
```suggestion
// If the data source hasn't overridden this method, it must not have
added support
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -2448,10 +2448,14 @@ class AstBuilder extends DataTypeAstBuilder
* - TABLESAMPLE(x ROWS): Sample the table down to the given number of rows.
* - TABLESAMPLE(x PERCENT) [REPEATABLE (y)]: Sample the table down to the
given percentage with
* seed 'y'. Note that percentages are defined as a number between 0 and 100.
+ * - TABLESAMPLE SYSTEM(x PERCENT): Sample by data source dependent blocks
or file splits.
Review Comment:
Compound modifier should be hyphenated.
```suggestion
* - TABLESAMPLE SYSTEM(x PERCENT): Sample by data-source-dependent blocks
or file splits.
```
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:
##########
@@ -838,6 +844,33 @@ object V2ScanRelationPushDown extends Rule[LogicalPlan]
with PredicateHelper {
}
def pushDownSample(plan: LogicalPlan): LogicalPlan = plan.transform {
+ case sample: Sample if sample.sampleMethod == SampleMethod.System =>
Review Comment:
The SYSTEM and BERNOULLI branches share most of the structure —
`TableSampleInfo` construction, the `PhysicalOperation(_, Nil, sHolder)`
pattern, and the `PushDownUtils.pushTableSample` call — and differ only in
failure handling. Worth consolidating into a single `case sample: Sample =>
sample.child match { ... }` that branches on `sampleMethod` only for the throw
vs. fall-back decision.
As a side effect, the two paths have already started to drift: the seed-TODO
comment lives only on the SYSTEM branch even though both branches use the same
`* 1000` multiplier.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]