wiedld opened a new issue, #9493:
URL: https://github.com/apache/arrow-datafusion/issues/9493

   ### Is your feature request related to a problem or challenge?
   
   We would like faster parquet write performance, outside of the Datafusion 
execution context.
   
   We are currently utilizing the (non-parallelized) ArrowWriter for parquet 
writing both within, and outside of, Datafusion query execution. Writing data 
in the parquet format is expensive computationally due to the encoding and 
compression involved, and can easily become a bottleneck when writing large 
parquet files.
   
   <img width="858" alt="Screen Shot 2024-03-07 at 10 48 16 AM" 
src="https://github.com/apache/arrow-datafusion/assets/10232835/d03e601a-ccbd-44b5-a0af-db7c0bd5c916";>
   
   
   Datafusion recently introduced a parallelized parquet writer as part of the 
COPYTO execution. This writer parallelizes the column writes with minimal 
memory overhead; streamed record batches are immediately encoded to compressed 
arrow column leafs, and the final serialized parquet is flushed to the sink in 
chunks without needing to retain the whole parquet in memory.
   
   <img width="855" alt="Screen Shot 2024-03-07 at 10 48 44 AM" 
src="https://github.com/apache/arrow-datafusion/assets/10232835/6751e752-f071-4ab3-a6fb-5726ddf98dde";>
   
   
   We conducted a POC in order to use the existing ParquetSink outside of a 
Datafusion query, and assessed impact. Our specific use case spends 49-59% of 
their CPU cycles in parquet writing (a.k.a. we have a write-heavy benchmark). 
When we switched from baseline (using single threaded ArrowWriter) to the 
parallelized parquet writing, we had a performance improvements of 22-43% 
faster. This provides ample motivation to request that a more principled 
solution be provided in order to have parallelized parquet writing more readily 
accessible.
   
   
   ### Describe the solution you'd like
   
   The ability to use parallelized parquet writing outside of the Datafusion 
query execution. Specifically, we would like to propose some public API which 
is not tied to the COPY TO execution operator.
   
   ### Describe alternatives you've considered
   
   Our specific POC required the [exposure of the 
FileMetaData](https://github.com/wiedld/arrow-datafusion/commit/cf54ed547a487a2323c0eb6d634575d26b340263)
 for created parquet files, and had to compensate for a [metadata mutation 
performed within 
ArrowWriter](https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/parquet/src/arrow/arrow_writer/mod.rs#L141)
 (but not ParquetSink). However, the decided solution should not be conflated 
with the POC we performed in order to assess potential impact for our use case. 
Given our anticipation that many other users may also benefit from parallelized 
parquet writes, the solution should consider a broader range of needs.
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to