[
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-10303:
---------------------------------
Summary: [Rust] Parallel type transformation in CSV reader (was: Parallel
type transformation in CSV reader)
> [Rust] Parallel type transformation in CSV reader
> -------------------------------------------------
>
> Key: ARROW-10303
> URL: https://issues.apache.org/jira/browse/ARROW-10303
> Project: Apache Arrow
> Issue Type: Wish
> Components: Rust
> Reporter: Sergej Fries
> Priority: Minor
> Labels: CSVReader
> Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for
> reading the file and for transformation of returned string values into
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40
> seconds. Out of this time, only ~10% of this is reading the file, and ~68%
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a
> batch, all projected columns are transformed one by one using an iterator
> over vector and a map function afterwards. I believe that if one uses the
> rayon crate, the only change will be the adjustment of "iter()" into
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}:
> {color:#000000}Read{color}>
> {color:#000000}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read
> {color}+
> {color:#000000}std{color}::{color:#000000}marker{color}::{color:#000000}Sync{color}>
> {color:#000000}Reader{color}<{color:#20999d}R{color}>
>
> But maybe I oversee something crucial (as being quite new in Rust and Arrow).
> Any advise from someone experienced is therefore very welcome!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)