[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292895#comment-17292895
 ] 

Dror Speiser commented on ARROW-10308:
--------------------------------------

Yeah for sure; I went into the open registry when you posted it the first time, 
and it's definitely _the thing_ to do: go over all of them, and measure.

Looking at some random files there, it's not trivial to discern what items are 
csv files, so it's manual work to register each data item in the registry, 
making this task feel more Sisyphean, even though in the long run (and even in 
the short run) most of the time will still be spent on other things...

I've now run the ny taxi benchmark with duplicating the columns axis, in 
contrast to duplicating the rows axis, and I think this changes things a bit. 
I'm starting to think that the optimal block size is going to depend on the 
number of columns. Is this something you would expect from the implementation?

Speaking of the implementation, we were talking about the parallelization. I 
read your implementation to get a better idea of what is happening. So, almost 
all the algorithm is parallel, but there is one step that isn't: chunking. 
Given that it processes data slower than reading from disk, on the larger 
machines this should then be the bottleneck of read_csv. I've implemented a 
second version of the chunker that uses a lookup table and simd operations 
(only x86) to apply the state changes, basically copying the code from this 
blog post:

[https://branchfree.org/2018/05/25/say-hello-to-my-little-friend-sheng-a-small-but-fast-deterministic-finite-automaton/]

The chunker speed indeed went up, but I didn't see performance gains overall, 
so I let it go for the time being.

I think the next step in boosting the speed (other than adaptive default block 
size) is to make the chunker run in parallel. The sheng blog post references 
this paper that shows a way to do this:

[https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/asplos302-mytkowicz.pdf]

But, take all of this with a large grain of salt. I don't have a profile output 
to really show that the chunker is the bottleneck, so maybe it's not.

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to