[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215396#comment-17215396 ]
Wes McKinney commented on ARROW-10308: -------------------------------------- I do think we should be doing better here than we are so it merits some analysis to see if some default options should change. The results do strike me as peculiar > [Python] read_csv from python is slow on some work loads > -------------------------------------------------------- > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark > Reporter: Dror Speiser > Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)