[I] Provide parallelized parquet writing outside of Datafusion query execution. [arrow-datafusion]

via GitHub Thu, 07 Mar 2024 10:54:21 -0800


wiedld opened a new issue, #9493:
URL: https://github.com/apache/arrow-datafusion/issues/9493

### Is your feature request related to a problem or challenge?

We would like faster parquet write performance, outside of the Datafusion
execution context.

We are currently utilizing the (non-parallelized) ArrowWriter for parquet
writing both within, and outside of, Datafusion query execution. Writing data
in the parquet format is expensive computationally due to the encoding and
compression involved, and can easily become a bottleneck when writing large
parquet files.

Datafusion recently introduced a parallelized parquet writer as part of the
COPYTO execution. This writer parallelizes the column writes with minimal
memory overhead; streamed record batches are immediately encoded to compressed
arrow column leafs, and the final serialized parquet is flushed to the sink in
chunks without needing to retain the whole parquet in memory.

We conducted a POC in order to use the existing ParquetSink outside of a
Datafusion query, and assessed impact. Our specific use case spends 49-59% of
their CPU cycles in parquet writing (a.k.a. we have a write-heavy benchmark).
When we switched from baseline (using single threaded ArrowWriter) to the
parallelized parquet writing, we had a performance improvements of 22-43%
faster. This provides ample motivation to request that a more principled
solution be provided in order to have parallelized parquet writing more readily
accessible.

### Describe the solution you'd like

The ability to use parallelized parquet writing outside of the Datafusion
query execution. Specifically, we would like to propose some public API which
is not tied to the COPY TO execution operator.

### Describe alternatives you've considered

Our specific POC required the [exposure of the
FileMetaData](https://github.com/wiedld/arrow-datafusion/commit/cf54ed547a487a2323c0eb6d634575d26b340263)
for created parquet files, and had to compensate for a [metadata mutation
performed within
ArrowWriter](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L141)
(but not ParquetSink). However, the decided solution should not be conflated
with the POC we performed in order to assess potential impact for our use case.
Given our anticipation that many other users may also benefit from parallelized
parquet writes, the solution should consider a broader range of needs.

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Provide parallelized parquet writing outside of Datafusion query execution. [arrow-datafusion]

Reply via email to