[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215519#comment-17215519
 ] 

Antoine Pitrou commented on ARROW-10308:
----------------------------------------

> Antoine, do you think this is a good idea? Do you have input on what csv 
> compositions are found in the wild?

Yes, that sounds like a very good idea. Instead of generating data, I think 
it's better to use actual data. You can find a variety of real-world datasets 
here:
 [https://github.com/awslabs/open-data-registry]

A commonly used dataset for demonstration and benchmarking purposes is the New 
York taxi dataset:
 [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page]

You may also find datasets of Twitter messages, which would be more text-heavy 
and therefore would stress the CSV reader a bit differently.

Generally, for multi-thread benchmarking, you want files that are at least 1GB 
long. It may be possible to take a smaller file and replicate its contents a 
number times to reach the desired size, though.

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to