Dror Speiser created ARROW-10308:
------------------------------------

             Summary: read_csv from python is slow on some work loads
                 Key: ARROW-10308
                 URL: https://issues.apache.org/jira/browse/ARROW-10308
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 1.0.1
         Environment: Machine: Azure, 48 vcpus, 384GiB ram
OS: Ubuntu 18.04
Dockerfile and script: attached, or here: 
https://github.com/drorspei/arrow-csv-benchmark
            Reporter: Dror Speiser
         Attachments: Dockerfile, benchmark-csv.py, profile1.svg, profile2.svg, 
profile3.svg, profile4.svg

Hi!

I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
processing data around 0.5GiB/s. "Real workloads" means many string, float, and 
all-null columns, and large file size (5-10GiB), though the file size didn't 
matter to much.

Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
the time is spent on shared pointer lock mechanisms (though I'm not sure if 
this is to be trusted). I've attached the dumps in svg format.

I've also attached a script and a Dockerfile to run a benchmark, which 
reproduces the speeds I see. Building the docker image and running it on a 
large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 
0.5GiB/s.

This is all also available here: https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to