metesynnada opened a new pull request, #8666: URL: https://github.com/apache/arrow-datafusion/pull/8666
## Which issue does this PR close? Closes #. ## Rationale for this change Currently, the serializer is re-created for each RecordBatch, which is not quite logical while dealing with small batch sizes. The `duplicate()` method is called here https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/write/orchestration.rs#L51-L82 where it is defined as (for CSV it is used for header, for JSON it is even more unnecessary) https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/csv.rs#L432-L450 Also, this makes the internal buffer useless since it is re-created for each batch in this setup. ## What changes are included in this PR? - Renamed the `BatchSerializer`. - Make the trait methods take immutable references. - Make the `type SerializerType = Arc<dyn SerializationSchema>` - Handle the making CSV header false for the batches after the first batch. ## Are these changes tested? Existing tests. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
