[
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292214#comment-17292214
]
Dror Speiser commented on ARROW-10308:
--------------------------------------
Hi Diana,
Cool!
I've created a small benchmark that spins up EC2 instances, downloads a NY Taxi
dataset, and runs read_csv with different block sizes. I'll upload the raw data
tomorrow, but meanwhile here is a draft basic analysis notebook on the data
that I already have:
[https://github.com/drorspei/arrow-csv-benchmark/blob/ec2-block-size/analysis.ipynb]
If you look in the containing branch, you will find two files: a script, and a
supporting file for boto code.
I'll be super glad to collaborate whenever I can :)
As for NUMA enabled, I think I was running Azure's E48_as_v4, which I see in
the table on this page:
[https://docs.microsoft.com/en-us/azure/virtual-machines/linux/compute-benchmark-scores]
There's a column "NUMA Nodes" which says "6". I'm not familiar with NUMA, so I
don't know what this means. Is there something I could run on the machine to
check?
> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here:
> https://github.com/drorspei/arrow-csv-benchmark
> Reporter: Dror Speiser
> Priority: Minor
> Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png,
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg,
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads,
> processing data around 0.5GiB/s. "Real workloads" means many string, float,
> and all-null columns, and large file size (5-10GiB), though the file size
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of
> the time is spent on shared pointer lock mechanisms (though I'm not sure if
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which
> reproduces the speeds I see. Building the docker image and running it on a
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly
> around 0.5GiB/s.
> This is all also available here:
> https://github.com/drorspei/arrow-csv-benchmark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)