[
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215393#comment-17215393
]
Dror Speiser commented on ARROW-10308:
--------------------------------------
Thanks for the suggestions :) I am indeed getting the files from a third party,
and I'm converting them to parquet on arrival using arrow. I'm actually content
with 0.5 GiB/s. I'm here because I saw a tweet by Wes Mckinney saying that the
csv parser in arrow is "extremely fast". I tweeted back my results and he
suggested that I open an issue.
I would like to note that the numbers don't quite add up. If the cpu usage is
totally accounted for by the operations of parsing and building arrays, then
that would mean that a single processor is doing between 0.06 to 0.13 GiB/s,
which is very slow.
When I run the benchmark without threads I get 0.3 GiB/s, which is reasonable
for a single processor. But, it also means that the 48 vcpus I have are very
far from achieving a linear speedup, which is in line with my profiling (though
the attached images are block size of 1 mb). Do you see a linear speedup on
your machine?
As for processing csv's being costly in general, I'm not familiar enough with
other libraries to say, but I am familiar with the simdjson library that claims
to parse json files at over 2 GiB/s, on a single core. I'm looking at the code
of both projects, hoping I'll be able to contribute something from simdjson to
the csv parser in arrow.
> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here:
> https://github.com/drorspei/arrow-csv-benchmark
> Reporter: Dror Speiser
> Priority: Minor
> Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png,
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg,
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads,
> processing data around 0.5GiB/s. "Real workloads" means many string, float,
> and all-null columns, and large file size (5-10GiB), though the file size
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of
> the time is spent on shared pointer lock mechanisms (though I'm not sure if
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which
> reproduces the speeds I see. Building the docker image and running it on a
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly
> around 0.5GiB/s.
> This is all also available here:
> https://github.com/drorspei/arrow-csv-benchmark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)