[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

Antoine Pitrou (Jira) Mon, 01 Mar 2021 05:15:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292872#comment-17292872
 ]


Antoine Pitrou commented on ARROW-10308:
----------------------------------------

"NUMA", as in "non-uniform memory access" means that different CPU cores will 
have varying latencies to different parts of memory. "Six NUMA nodes" therefore 
means there are six different groups of cores with distinct memory access 
latencies.

Note that "memory" is to be taken in a wide sense and can also include some 
shared caches. For example, on my CPU (a AMD Zen 2 CPU), the L3 cache is 
private to clusters of 4 CPU cores (and my CPU has 12 CPU cores).

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

Reply via email to