dongjoon-hyun opened a new pull request, #411:
URL: https://github.com/apache/spark-connect-swift/pull/411

   ### What changes were proposed in this pull request?
   
   This PR aims to support `sampleBy` for `DataFrame` by wiring the 
`StatSampleBy`
   Spark Connect relation through `DataFrameStatFunctions`, exposed via 
`DataFrame.stat`
   like PySpark/Scala.
   
   ```swift
   public func sampleBy<T: Sendable & Hashable>(_ col: String, _ fractions: [T: 
Double], _ seed: Int64) async -> DataFrame
   public func sampleBy<T: Sendable & Hashable>(_ col: String, _ fractions: [T: 
Double]) async -> DataFrame
   ```
   
   `sampleBy` returns a stratified sample without replacement based on the 
fraction
   given for each stratum. A stratum that is not specified is treated as having 
a
   fraction of zero. The seed is optional; a random seed is used when it is 
omitted.
   
   ### Why are the changes needed?
   
   To improve API coverage by mirroring PySpark/Scala `DataFrameStatFunctions`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR adds a new API, `DataFrame.stat.sampleBy`.
   
   ### How was this patch tested?
   
   Pass the CIs with a new test case, `sampleBy`, in 
`DataFrameStatFunctionsTests`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to