Jefffrey commented on PR #17674: URL: https://github.com/apache/datafusion/pull/17674#issuecomment-3388333532
> > Are we planning to implement the seed argument? There wasn't any response to my comments 🙁 @chenkovsky > > @Jefffrey , for seed, it's more complex. In spark, different partition have different seeds. (real_seed = partition_id + user_specified_seed). In datafusion, I think we should use (real_seed = partition_id + record_batch_idx_in_partition + user_specified_seed). But currently, there's no way to get partition_id or record_batch_id. If we don't put partition_id and record_batch_idx_in_partition into seed, all record batches will have same shuffle behavior. I think it's incorrect. I created another issue. #17686 So are we fine with merging this Spark function without implementing seed argument? If this was ported from sail or comet I assume they were fine with this, though I see no mention of if this was the case. So long as we track this in an issue (e.g. in the referenced issue) I guess it's fine for now. My only concern other than that is this could make the tests flaky, as some of the tests rely on the shuffled output being not equal to sorted output. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
