[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

Dror Speiser (Jira) Fri, 16 Oct 2020 06:27:36 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215393#comment-17215393
 ]


Dror Speiser commented on ARROW-10308:
--------------------------------------

Thanks for the suggestions :) I am indeed getting the files from a third party, 
and I'm converting them to parquet on arrival using arrow. I'm actually content 
with 0.5 GiB/s. I'm here because I saw a tweet by Wes Mckinney saying that the 
csv parser in arrow is "extremely fast". I tweeted back my results and he 
suggested that I open an issue.

I would like to note that the numbers don't quite add up. If the cpu usage is 
totally accounted for by the operations of parsing and building arrays, then 
that would mean that a single processor is doing between 0.06 to 0.13 GiB/s, 
which is very slow.

When I run the benchmark without threads I get 0.3 GiB/s, which is reasonable 
for a single processor. But, it also means that the 48 vcpus I have are very 
far from achieving a linear speedup, which is in line with my profiling (though 
the attached images are block size of 1 mb). Do you see a linear speedup on 
your machine?

As for processing csv's being costly in general, I'm not familiar enough with 
other libraries to say, but I am familiar with the simdjson library that claims 
to parse json files at over 2 GiB/s, on a single core. I'm looking at the code 
of both projects, hoping I'll be able to contribute something from simdjson to 
the csv parser in arrow.

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

Reply via email to