wilmerdooley opened a new pull request, #56608:
URL: https://github.com/apache/spark/pull/56608

   ### What changes were proposed in this pull request?
   
   When a sample or `TABLESAMPLE` runs without an explicit seed, Spark resolved 
the default seed via `(math.random() * 1000).toLong`, which only produces about 
1000 distinct values (0 to 999). This change replaces that expression at both 
call sites with `Utils.random.nextLong()`, which draws from the full `Long` 
range:
   
   - 
`sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala`:
 `SampleExec.resolvedSeed` now defaults to `Utils.random.nextLong()` (and adds 
the `Utils` import).
   - 
`sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala`:
 `pushDownSample` applies the same default so the two code paths stay 
consistent, and removes the now-stale `TODO(SPARK-56573)` comment above the 
call.
   
   The explicit-seed path (`Some(seed)`, including `TABLESAMPLE ... 
REPEATABLE(n)` and `DataFrame.sample(seed = ...)`) is unchanged, as are the 
seed type (`Long`) and the pushed `SEED(...)` explain text.
   
   ### Why are the changes needed?
   
   A 1000-value default-seed space means independent sample queries that do not 
set a seed collide on the same seed often, which weakens the statistical 
independence expected of separate samples. Widening the default to the full 
`Long` range reduces those collisions. The in-tree `TODO(SPARK-56573)` already 
flagged this and asked for it to be fixed across both call sites.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Behavior changes only for samples that do not specify a seed, where the 
default seed is now drawn from a wider range; results were already 
non-deterministic in that case. Explicit-seed and `REPEATABLE` behavior is 
unchanged.
   
   ### How was this patch tested?
   
   Existing `sql/core` tests that pin sample behavior with explicit seeds 
continue to pass, run with `build/sbt -Phadoop-3 "sql/testOnly 
org.apache.spark.sql.connector.DataSourceV2TableSampleSuite 
org.apache.spark.sql.DataFrameStatSuite"` (the DSv2 pushdown path and the 
`SampleExec` path). No new test asserts the default-seed range, since the 
default seed is non-deterministic by design and a distinct-count assertion 
would be flaky.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenAI Codex (GPT-5.5)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to