Sergej Fries created ARROW-10303:
------------------------------------

             Summary: Parallel type transformation in CSV reader
                 Key: ARROW-10303
                 URL: https://issues.apache.org/jira/browse/ARROW-10303
             Project: Apache Arrow
          Issue Type: Wish
          Components: Rust
            Reporter: Sergej Fries
         Attachments: tracing.png

Currently, when the CSV file is read, a single thread is responsible for 
reading the file and for transformation of returned string values into correct 
data types.

In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
seconds. Out of this time, only ~10% of this is reading the file,  and ~68% is 
transformation of the string values into correct data types.

My proposal is to parallelize the part responsible for the data type 
transformation.

It seems to be quite simple to achieve since after the CSV reader reads a 
batch, all projected columns are transformed one by one using an iterator over 
vector and a map function afterwards. I believe that if one uses the rayon 
crate, the only change will be the adjustment of "iter()" into "par_iter()" and

changing

{color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read{color}> 
{color:#000000}Reader{color}<{color:#20999d}R{color}>

into:

{color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#000000}Read 
{color}+ 
{color:#000000}std{color}::{color:#000000}marker{color}::{color:#000000}Sync{color}>
 {color:#000000}Reader{color}<{color:#20999d}R{color}>

 

But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to