[
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215401#comment-17215401
]
Antoine Pitrou commented on ARROW-10308:
----------------------------------------
"vcpu" doesn't mean anything precise unfortunately. What is the CPU model and
how many *physical* cores are allocated to the virtual machine?
> I am familiar with the simdjson library that claims to parse json files at
> over 2 GiB/s, on a single core
It all depends what "parsing" entails, what data it is tested on, and what is
done with the data once parsed.
On our internal micro-benchmarks, the Arrow CSV parser runs at around 600 MB/s
(on a single core), but that's data-dependent. I tend to test on data with
narrow column values since that's what "big data" often looks like, and that's
the most difficult case for a CSV parser. It's possible that better speeds can
be achieved on larger column values (such as large binary strings).
But parsing isn't sufficient, then you have to convert the data to Arrow
format, which also means you switch from a row-oriented format to a
column-oriented format. That part probably hits quite hard on the memory and
cache subsystem.
> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here:
> https://github.com/drorspei/arrow-csv-benchmark
> Reporter: Dror Speiser
> Priority: Minor
> Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png,
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg,
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads,
> processing data around 0.5GiB/s. "Real workloads" means many string, float,
> and all-null columns, and large file size (5-10GiB), though the file size
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of
> the time is spent on shared pointer lock mechanisms (though I'm not sure if
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which
> reproduces the speeds I see. Building the docker image and running it on a
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly
> around 0.5GiB/s.
> This is all also available here:
> https://github.com/drorspei/arrow-csv-benchmark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)