metesynnada opened a new pull request, #8666:
URL: https://github.com/apache/arrow-datafusion/pull/8666

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   Currently, the serializer is re-created for each RecordBatch, which is not 
quite logical while dealing with small batch sizes.
   
   The `duplicate()` method is called here
   
   
https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/write/orchestration.rs#L51-L82
   
   where it is defined as (for CSV it is used for header, for JSON it is even 
more unnecessary)
   
   
https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/csv.rs#L432-L450
   
   Also, this makes the internal buffer useless since it is re-created for each 
batch in this setup.
   
   ## What changes are included in this PR?
   
   - Renamed the `BatchSerializer`.
   - Make the trait methods take immutable references.
   - Make the `type SerializerType = Arc<dyn SerializationSchema>`
   - Handle the making CSV header false for the batches after the first batch.
   
   ## Are these changes tested?
   
   Existing tests.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to