[ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293912#comment-17293912
 ] 

Antoine Pitrou commented on ARROW-10308:
----------------------------------------

Hmm, I'm a bit surprised by your assessment of chunking. Unless you're enabling 
{{ParseOptions::newlines_in_values}}, chunking is basically a linear search for 
an end-of-line character (you can look for {{NewlineBoundaryFinder}} in 
{{src/arrow/util}}). So it should be extremely fast. I'd be curious of a case 
where this chunking is the bottleneck.

To answer your other question, I do think that optimal block size is going to 
depend on the number of columns. Basically, you want a single block to contain 
enough rows so that administration overhead is minimal compared to converting 
actual column batches. But of course, if the block size grows too much, the CPU 
caches will be less efficient and parallelization may also suffer from a 
coarser-grained partition of data. I can't think of a simple formula that can 
integrate all these concerns and compute the ideal block size for a given 
situation, so I think it's best to let users experiment.

(that said, we *can* change the default block size if we decide that another 
value, e.g. 4 MB, is more likely to be close to the optimum in most cases)

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to