Shailesh-Kumar-Singh opened a new issue, #9499:
URL: https://github.com/apache/arrow-rs/issues/9499
**Which part is this question about**
<!--
Is it code base, library api, documentation or some other part?
-->
Library API, specifically the interaction between ArrowRowGroupWriterFactory
/ ArrowColumnChunk (sync, parallel encoding) and AsyncArrowWriter (async,
sequential encoding).
**Describe your question**
We're building a high-throughput streaming k-way merge for sorted Parquet
files. The write pipeline looks like:
read (rayon decode + channel prefetch) → merge sort → parallel encode
(rayon) → write to disk
We want both parallel column encoding and async disk writes. Currently the
API only allows picking one.
****Path A:** Parallel encode, sync write**
```
let col_writers = rg_writer_factory.create_column_writers(rg_index)?;
let chunks: Vec<ArrowColumnChunk> = rayon::install(|| {
leaves_and_writers
.into_par_iter()
.map(|(leaf, mut col_writer)| {
col_writer.write(&leaf)?;
col_writer.close()
})
.collect()
})?;
// append_to_row_group requires sync SerializedFileWriter
let mut rg = writer.next_row_group()?;
for chunk in chunks {
chunk.append_to_row_group(&mut rg)?;
}
rg.close()?;
```
**Path B: Async write, sequential encode**
```
let mut writer = AsyncArrowWriter::try_new(file, schema, Some(props))?;
writer.write(&batch).await?;
writer.close().await?;
```
**The gap:** ArrowColumnChunk (the output of parallel encoding) can only be
appended through sync SerializedFileWriter. There's no async equivalent.
**Question:**
Is there a way to combine parallel encoding with async writes
<!--
A clear and concise description of what the question is.
-->
**Additional context**
Both read (decode) and write (encode) use a shared rayon pool for
parallelism, the only sync bottleneck is the actual disk write inside
append_to_row_group
<!--
Add any other context about the problem here.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]