[
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291016#comment-17291016
]
Diana Clarke commented on ARROW-10308:
--------------------------------------
Hi Dror,
Profiling different file sizes, compositions, and block sizes on various AWS
instances would be fantastic. Based on that data, perhaps we could come up with
a better default value for {{ReadOptions.block_size}}.
I ran your benchmark on my local machine yesterday, experimenting with
different block sizes, and for this particular case, defaulting the
{{block_size}} to {{None}} was definitely not optimal.
Incidentally, I'm currently working on a continuous benchmarking framework to
run Arrow benchmarks on each commit (to safeguard against regressions), but
also to do this kind of research.
The continuous benchmarking repo isn't quite ready to be made public, but once
it is, we can collaborate and add your benchmark to it. It'll be fun!
Thanks for taking the time to run these benchmarks, and for sharing your code
and results.
Much appreciated,
--diana
PS. Were you able to find out if the 48 vcpu machine was NUMA enabled?
> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here:
> https://github.com/drorspei/arrow-csv-benchmark
> Reporter: Dror Speiser
> Priority: Minor
> Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png,
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg,
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads,
> processing data around 0.5GiB/s. "Real workloads" means many string, float,
> and all-null columns, and large file size (5-10GiB), though the file size
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of
> the time is spent on shared pointer lock mechanisms (though I'm not sure if
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which
> reproduces the speeds I see. Building the docker image and running it on a
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly
> around 0.5GiB/s.
> This is all also available here:
> https://github.com/drorspei/arrow-csv-benchmark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)