Re: [PR] feat: spark udf array shuffle [datafusion]

via GitHub Sat, 18 Oct 2025 08:58:29 -0700


Jefffrey commented on PR #17674:
URL: https://github.com/apache/datafusion/pull/17674#issuecomment-3388333532


   > > Are we planning to implement the seed argument? There wasn't any 
response to my comments 🙁 @chenkovsky
   > 
   > @Jefffrey , for seed, it's more complex. In spark, different partition 
have different seeds. (real_seed = partition_id + user_specified_seed). In 
datafusion, I think we should use (real_seed = partition_id + 
record_batch_idx_in_partition + user_specified_seed). But currently, there's no 
way to get partition_id or record_batch_id. If we don't put partition_id and 
record_batch_idx_in_partition into seed, all record batches will have same 
shuffle behavior. I think it's incorrect. I created another issue. #17686
   
   So are we fine with merging this Spark function without implementing seed 
argument? If this was ported from sail or comet I assume they were fine with 
this, though I see no mention of if this was the case. So long as we track this 
in an issue (e.g. in the referenced issue) I guess it's fine for now.
   
   My only concern other than that is this could make the tests flaky, as some 
of the tests rely on the shuffled output being not equal to sorted output.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: spark udf array shuffle [datafusion]

Reply via email to