Re: [PR] feat: spark udf array shuffle [datafusion]

via GitHub Sat, 18 Oct 2025 06:16:27 -0700


chenkovsky commented on PR #17674:
URL: https://github.com/apache/datafusion/pull/17674#issuecomment-3388202127


   > Are we planning to implement the seed argument? There wasn't any response 
to my comments 🙁 @chenkovsky
   
   @Jefffrey , for seed, it's more complex. In spark, different partition have 
different seeds. (real_seed = partition_id + user_specified_seed). In 
datafusion, I think we should use (real_seed = partition_id + 
record_batch_idx_in_partition + user_specified_seed). But currently, there's no 
way to get partition_id or record_batch_id.
   If we don't put partition_id and record_batch_idx_in_partition into seed, 
all record batches will have same shuffle behavior. I think it's incorrect. I 
created another issue. #17686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: spark udf array shuffle [datafusion]

Reply via email to