[ 
https://issues.apache.org/jira/browse/ARROW-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10303:
---------------------------------
    Summary: [Rust] Parallel type transformation in CSV reader  (was: Parallel 
type transformation in CSV reader)

> [Rust] Parallel type transformation in CSV reader
> -------------------------------------------------
>
>                 Key: ARROW-10303
>                 URL: https://issues.apache.org/jira/browse/ARROW-10303
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Rust
>            Reporter: Sergej Fries
>            Priority: Minor
>              Labels: CSVReader
>         Attachments: tracing.png
>
>
> Currently, when the CSV file is read, a single thread is responsible for 
> reading the file and for transformation of returned string values into 
> correct data types.
> In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
> seconds. Out of this time, only ~10% of this is reading the file,  and ~68% 
> is transformation of the string values into correct data types.
> My proposal is to parallelize the part responsible for the data type 
> transformation.
> It seems to be quite simple to achieve since after the CSV reader reads a 
> batch, all projected columns are transformed one by one using an iterator 
> over vector and a map function afterwards. I believe that if one uses the 
> rayon crate, the only change will be the adjustment of "iter()" into 
> "par_iter()" and
> changing
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: 
> {color:#000000}Read{color}> 
> {color:#000000}Reader{color}<{color:#20999d}R{color}>
> into:
> {color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read 
> {color}+ 
> {color:#000000}std{color}::{color:#000000}marker{color}::{color:#000000}Sync{color}>
>  {color:#000000}Reader{color}<{color:#20999d}R{color}>
>  
> But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
> Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to