[I] How is it intended to do multi-threading with this library? [arrow-rs]

via GitHub Thu, 13 Mar 2025 08:07:52 -0700


jonded94 opened a new issue, #7284:
URL: https://github.com/apache/arrow-rs/issues/7284


   **Which part is this question about**
   Library API / UX
   
   **Describe your question**
   With a lot of functions in the `pyarrow` package, there is already some 
multithreading implemented for you.
   As far as I understand, reading from a file for example is multithreaded by 
letting each column be processed by a separate thread.
   
   As far as I know, there is nothing directly available for you in this Rust 
crate. What are users expected to do here?
   
   For example, for simply reading a parquet file in a parallized manner, would 
one do something like this?
   1. first look at the schema to find out what the columns are
   2. spawn an async worker for each column that reads from the same file, but 
with a filter for just one column 
   3. collect all RecordBatches from each worker and merge it into one 
RecordBatch containing all the data
   
   If there is nothing already offered for you in this crate that does this, 
should this maybe be part of this crate?
   
   How could parallelized writes work? It's not easily possible to just write 
parquet files containing one column each and then merge it afterwards, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] How is it intended to do multi-threading with this library? [arrow-rs]

Reply via email to