[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

Jonathan Keane (Jira) Wed, 03 Feb 2021 16:04:17 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278429#comment-17278429
 ]


Jonathan Keane commented on ARROW-11433:
----------------------------------------

After digging a bit more, I do think that ARROW-6994 is related. I've tried 
[changing the decays 
|https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L82-L83]to
 be something non-zero (e.g. 1000 to match the lines above) and the problem 
goes away.

I've just tried setting both decays to 1000 and re-running the script in 
ARROW-6994 / ARROW-7305 and it doesn't appear to change _too much_:

with: ,dirty_decay_ms:1000 and ,muzzy_decay_ms:1000
{code}
(pyarrow-source1000) jkeane@het python % python test.py
Starting RSS: 52789248
Read CSV RSS: 170905600
Wrote Parquet RSS: 580022272
Waited 1 second RSS: 580022272
Read CSV RSS: 640917504
Wrote Parquet RSS: 645062656
Waited 1 second RSS: 645062656
Read CSV RSS: 705339392
Wrote Parquet RSS: 707686400
Waited 1 second RSS: 707686400
Read CSV RSS: 717934592
Wrote Parquet RSS: 718041088
Waited 1 second RSS: 718041088
Read CSV RSS: 771768320
Wrote Parquet RSS: 772018176
Waited 1 second RSS: 772018176
Read CSV RSS: 774455296
Wrote Parquet RSS: 776101888
Waited 1 second RSS: 776101888
Read CSV RSS: 784887808
Wrote Parquet RSS: 784891904
Waited 1 second RSS: 784891904
Read CSV RSS: 789123072
Wrote Parquet RSS: 789123072
Waited 1 second RSS: 789123072
Read CSV RSS: 789585920
Wrote Parquet RSS: 793587712
Waited 1 second RSS: 793587712
Read CSV RSS: 793608192
Wrote Parquet RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
Waited 1 second RSS: 793931776
{code}

with: ,dirty_decay_ms:0 and ,muzzy_decay_ms:0
{code}
(pyarrow-source) jkeane@het python % python test.py    
Starting RSS: 52600832
Read CSV RSS: 168427520
Wrote Parquet RSS: 277258240
Waited 1 second RSS: 277258240
Read CSV RSS: 336883712
Wrote Parquet RSS: 341028864
Waited 1 second RSS: 341028864
Read CSV RSS: 403419136
Wrote Parquet RSS: 403562496
Waited 1 second RSS: 403562496
Read CSV RSS: 462204928
Wrote Parquet RSS: 462135296
Waited 1 second RSS: 462135296
Read CSV RSS: 521125888
Wrote Parquet RSS: 525246464
Waited 1 second RSS: 525246464
Read CSV RSS: 586534912
Wrote Parquet RSS: 586469376
Waited 1 second RSS: 586469376
Read CSV RSS: 589545472
Wrote Parquet RSS: 589492224
Waited 1 second RSS: 589492224
Read CSV RSS: 595734528
Wrote Parquet RSS: 595947520
Waited 1 second RSS: 595947520
Read CSV RSS: 605380608
Wrote Parquet RSS: 605298688
Waited 1 second RSS: 605298688
Read CSV RSS: 610312192
Wrote Parquet RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
Waited 1 second RSS: 610430976
{code}

> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
>                 Key: ARROW-11433
>                 URL: https://issues.apache.org/jira/browse/ARROW-11433
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Jonathan Keane
>            Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> +     print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> +     tab <- NULL
> +   }
> + })
>    user  system elapsed 
>  24.788  19.485   7.216 
>    user  system elapsed 
>  24.952  21.786   9.225 
>    user  system elapsed 
>  25.150  23.039  10.332 
>    user  system elapsed 
>  25.382  31.012  17.995 
>    user  system elapsed 
>  25.309  25.140  12.356 
>    user  system elapsed 
>  25.302  26.975  13.938 
>    user  system elapsed 
>  25.509  34.390  21.134 
>    user  system elapsed 
>  25.674  28.195  15.048 
>    user  system elapsed 
>  25.031  28.094  16.449 
>    user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>    user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> +     print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> +     pool <- arrow:::default_memory_pool()
> +     tab <- NULL
> +   }
> + })
>    user  system elapsed 
>  25.257  19.475   6.785 
>    user  system elapsed 
>  25.271  19.838   6.821 
>    user  system elapsed 
>  25.288  20.103   6.861 
>    user  system elapsed 
>  25.188  20.290   7.217 
>    user  system elapsed 
>  25.283  20.043   6.832 
>    user  system elapsed 
>  25.194  19.947   6.906 
>    user  system elapsed 
>  25.278  19.993   6.834 
>    user  system elapsed 
>  25.355  20.018   6.833 
>    user  system elapsed 
>  24.986  19.869   6.865 
>    user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>    user  system elapsed 
> 255.381 210.598  83.109 
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
>     start = time.time()
>     table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
>     print(time.time() - start)
>     del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results reading csv

Reply via email to