[PR] Make the BatchSerializer behind Arc to avoid unnecessary struct creation [arrow-datafusion]

via GitHub Thu, 28 Dec 2023 04:24:56 -0800


metesynnada opened a new pull request, #8666:
URL: https://github.com/apache/arrow-datafusion/pull/8666

## Which issue does this PR close?

Closes #.

## Rationale for this change

Currently, the serializer is re-created for each RecordBatch, which is not
quite logical while dealing with small batch sizes.

The `duplicate()` method is called here

https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/write/orchestration.rs#L51-L82

where it is defined as (for CSV it is used for header, for JSON it is even
more unnecessary)

https://github.com/apache/arrow-datafusion/blob/1737d49185e9e37c15aa432342604ee559a1069d/datafusion/core/src/datasource/file_format/csv.rs#L432-L450

Also, this makes the internal buffer useless since it is re-created for each
batch in this setup.

## What changes are included in this PR?

- Renamed the `BatchSerializer`.
- Make the trait methods take immutable references.
- Make the `type SerializerType = Arc<dyn SerializationSchema>`
- Handle the making CSV header false for the batches after the first batch.

## Are these changes tested?

Existing tests.

## Are there any user-facing changes?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Make the BatchSerializer behind Arc to avoid unnecessary struct creation [arrow-datafusion]

Reply via email to