[
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215483#comment-17215483
]
Dror Speiser commented on ARROW-10308:
--------------------------------------
Yeah, Azure doesn't tell me how many physical cores are at my disposal, which
makes it hard to compare between setups. But even if it's 12 cpus with
hyperthreading and bad advertising, there is still a gap to be explained
between single thread and multi thread performance.
I offer to work on a benchmark that measures reading csvs of different sizes
and compositions, for a variety of block sizes, and run it on a few different
machines sizes on AWS (tiny to xlarge) and Azure, and report here the results.
Antoine, do you think this is a good idea? Do you have input on what csv
compositions are found in the wild? You said that narrow columns is common, how
would you quantify this? Personally I work with finance and real estate data; I
can create "data profiles" for what I see in my own workloads and share.
> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here:
> https://github.com/drorspei/arrow-csv-benchmark
> Reporter: Dror Speiser
> Priority: Minor
> Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png,
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg,
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads,
> processing data around 0.5GiB/s. "Real workloads" means many string, float,
> and all-null columns, and large file size (5-10GiB), though the file size
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of
> the time is spent on shared pointer lock mechanisms (though I'm not sure if
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which
> reproduces the speeds I see. Building the docker image and running it on a
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly
> around 0.5GiB/s.
> This is all also available here:
> https://github.com/drorspei/arrow-csv-benchmark
--
This message was sent by Atlassian Jira
(v8.3.4#803005)