[PR] [SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions [spark]

via GitHub Thu, 18 Jun 2026 18:28:41 -0700


wilmerdooley opened a new pull request, #56608:
URL: https://github.com/apache/spark/pull/56608

### What changes were proposed in this pull request?

When a sample or `TABLESAMPLE` runs without an explicit seed, Spark resolved
the default seed via `(math.random() * 1000).toLong`, which only produces about
1000 distinct values (0 to 999). This change replaces that expression at both
call sites with `Utils.random.nextLong()`, which draws from the full `Long`
range:

-
`sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala`:
`SampleExec.resolvedSeed` now defaults to `Utils.random.nextLong()` (and adds
the `Utils` import).
-
`sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala`:
`pushDownSample` applies the same default so the two code paths stay
consistent, and removes the now-stale `TODO(SPARK-56573)` comment above the
call.

The explicit-seed path (`Some(seed)`, including `TABLESAMPLE ...
REPEATABLE(n)` and `DataFrame.sample(seed = ...)`) is unchanged, as are the
seed type (`Long`) and the pushed `SEED(...)` explain text.

### Why are the changes needed?

A 1000-value default-seed space means independent sample queries that do not
set a seed collide on the same seed often, which weakens the statistical
independence expected of separate samples. Widening the default to the full
`Long` range reduces those collisions. The in-tree `TODO(SPARK-56573)` already
flagged this and asked for it to be fixed across both call sites.

### Does this PR introduce _any_ user-facing change?

No. Behavior changes only for samples that do not specify a seed, where the
default seed is now drawn from a wider range; results were already
non-deterministic in that case. Explicit-seed and `REPEATABLE` behavior is
unchanged.

### How was this patch tested?

Existing `sql/core` tests that pin sample behavior with explicit seeds
continue to pass, run with `build/sbt -Phadoop-3 "sql/testOnly
org.apache.spark.sql.connector.DataSourceV2TableSampleSuite
org.apache.spark.sql.DataFrameStatSuite"` (the DSv2 pushdown path and the
`SampleExec` path). No new test asserts the default-seed range, since the
default seed is non-deterministic by design and a distinct-count assertion
would be flaky.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5.5)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions [spark]

Reply via email to