[GitHub] [spark] huaxingao opened a new pull request #34442: [SPARK-37165][SQL] Add REPEATABLE in TABLESAMPLE to specify seed

GitBox Fri, 29 Oct 2021 09:03:59 -0700


huaxingao opened a new pull request #34442:
URL: https://github.com/apache/spark/pull/34442



   
   ### What changes were proposed in this pull request?
   
   Add REPEATABLE in SQL syntax TABLESAMPLE so user can specify seed.
   
   ### Why are the changes needed?
   
   Current syntax for TABLESAMPLE:
   
   - TABLESAMPLE(x PERCENT)
   - TABLESAMPLE(BUCKET x OUT OF y)
   
   `Dataset.sample` has a param to specify seed, so we should allow SQL has a 
way to specify seed too.
   ```
     def sample(fraction: Double, seed: Long): Dataset[T] = {
       sample(withReplacement = false, fraction = fraction, seed = seed)
     }
   ```
   Most of the DBMS uses REPEATABLE to let user specify seed, e.g. DB2, we will 
follow the same way.
   
   <img width="1032" alt="Screen Shot 2021-10-29 at 8 46 04 AM" 
src="https://user-images.githubusercontent.com/13592258/139465718-285ab5fb-a9cf-4bef-bc32-88301745b12b.png";>
   
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes
   new SQL syntax
   
   - TABLESAMPLE(x PERCENT) REPEATABLE (seed)
   - TABLESAMPLE(BUCKET x OUT OF y) REPEATABLE (seed)
   
   ### How was this patch tested?
   new UT
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] huaxingao opened a new pull request #34442: [SPARK-37165][SQL] Add REPEATABLE in TABLESAMPLE to specify seed

Reply via email to