Yicong-Huang commented on PR #56730:
URL: https://github.com/apache/spark/pull/56730#issuecomment-4792673060

   > > The worker output of the new iterator bench was verified to be 
byte-identical to the non-iterator Pandas grouped-agg bench
   > 
   > Minor note regarding the PR description, please confirm - in worker.py: 
the non-iterator SQL_GROUPED_AGG_PANDAS_UDF writes via 
ArrowStreamGroupSerializer(write_start_stream=True) while the ITER variant uses 
ArrowStreamAggPandasUDFSerializer; genuinely different output 
serializers/markers, so the byte streams are not identical. Please update in 
order to avoid misleading a future reader.
   
   Thanks @uros-b, yes at the current stage they are using different 
serializers, but the byte streams are designed to be identical. We are 
refactoring those serializers to consolidate them, while keeping the byte 
streams identical. this benchmark is a safe guard for future refactor to verify 
no performance regression. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to