wilmerdooley opened a new pull request, #56608: URL: https://github.com/apache/spark/pull/56608
### What changes were proposed in this pull request? When a sample or `TABLESAMPLE` runs without an explicit seed, Spark resolved the default seed via `(math.random() * 1000).toLong`, which only produces about 1000 distinct values (0 to 999). This change replaces that expression at both call sites with `Utils.random.nextLong()`, which draws from the full `Long` range: - `sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala`: `SampleExec.resolvedSeed` now defaults to `Utils.random.nextLong()` (and adds the `Utils` import). - `sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala`: `pushDownSample` applies the same default so the two code paths stay consistent, and removes the now-stale `TODO(SPARK-56573)` comment above the call. The explicit-seed path (`Some(seed)`, including `TABLESAMPLE ... REPEATABLE(n)` and `DataFrame.sample(seed = ...)`) is unchanged, as are the seed type (`Long`) and the pushed `SEED(...)` explain text. ### Why are the changes needed? A 1000-value default-seed space means independent sample queries that do not set a seed collide on the same seed often, which weakens the statistical independence expected of separate samples. Widening the default to the full `Long` range reduces those collisions. The in-tree `TODO(SPARK-56573)` already flagged this and asked for it to be fixed across both call sites. ### Does this PR introduce _any_ user-facing change? No. Behavior changes only for samples that do not specify a seed, where the default seed is now drawn from a wider range; results were already non-deterministic in that case. Explicit-seed and `REPEATABLE` behavior is unchanged. ### How was this patch tested? Existing `sql/core` tests that pin sample behavior with explicit seeds continue to pass, run with `build/sbt -Phadoop-3 "sql/testOnly org.apache.spark.sql.connector.DataSourceV2TableSampleSuite org.apache.spark.sql.DataFrameStatSuite"` (the DSv2 pushdown path and the `SampleExec` path). No new test asserts the default-seed range, since the default seed is non-deterministic by design and a distinct-count assertion would be flaky. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenAI Codex (GPT-5.5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
