[ 
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-11433:
-----------------------------------
    Description: 
This came up working on benchmarking Arrow's CSV reading. As far as I can tell 
this only impacts R, and only when reading the csv into arrow (but not pulling 
it in to R). It appears that most arrow interactions after the csv is read will 
result in this behavior not happening.

What I'm seeing is that on subsequent reads, the time to read gets longer and 
longer (frequently in a stair step pattern where every other iteration takes 
longer).

{code:r}
> system.time({
+   for (i in 1:10) {
+     print(system.time(tab <- 
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+     tab <- NULL
+   }
+ })
   user  system elapsed 
 24.788  19.485   7.216 
   user  system elapsed 
 24.952  21.786   9.225 
   user  system elapsed 
 25.150  23.039  10.332 
   user  system elapsed 
 25.382  31.012  17.995 
   user  system elapsed 
 25.309  25.140  12.356 
   user  system elapsed 
 25.302  26.975  13.938 
   user  system elapsed 
 25.509  34.390  21.134 
   user  system elapsed 
 25.674  28.195  15.048 
   user  system elapsed 
 25.031  28.094  16.449 
   user  system elapsed 
 25.825  37.165  23.379 
# total time:
   user  system elapsed 
256.178 299.671 175.119 
{code}

Interestingly, doing something as unrelated as 
{{arrow:::default_memory_pool()}} which is [only getting the default memory 
pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
 Other interactions totally unrelated to the table also similarly alleviate 
this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or proactively 
invalidating with {{tab$invalidate()}}

{code:r}
> system.time({
+   for (i in 1:10) {
+     print(system.time(tab <- 
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+     pool <- arrow:::default_memory_pool()
+     tab <- NULL
+   }
+ })
   user  system elapsed 
 25.257  19.475   6.785 
   user  system elapsed 
 25.271  19.838   6.821 
   user  system elapsed 
 25.288  20.103   6.861 
   user  system elapsed 
 25.188  20.290   7.217 
   user  system elapsed 
 25.283  20.043   6.832 
   user  system elapsed 
 25.194  19.947   6.906 
   user  system elapsed 
 25.278  19.993   6.834 
   user  system elapsed 
 25.355  20.018   6.833 
   user  system elapsed 
 24.986  19.869   6.865 
   user  system elapsed 
 25.130  19.878   6.798 
# total time:
   user  system elapsed 
255.381 210.598  83.109 ​
> 
{code}


I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
same behavior.

I checked against pyarrow, and do not see the same:

{code:python}
from pyarrow import csv
import time

for i in range(1, 10):
    start = time.time()
    table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
    print(time.time() - start)
    del table
{code}

results:

{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code} 

  was:
This came up working on benchmarking Arrow's CSV reading. As far as I can tell 
this only impacts R, and only when reading the csv into arrow (but not pulling 
it in to R). It appears that most arrow interactions after the csv is read will 
result in this behavior not happening.

What I'm seeing is that on subsequent reads, the time to read gets longer and 
longer (frequently in a stair step pattern where every other iteration takes 
longer).

{code:r}
> system.time({
+   for (i in 1:10) {
+     print(system.time(tab <- 
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+     tab <- NULL
+   }
+ })
   user  system elapsed 
 24.788  19.485   7.216 
   user  system elapsed 
 24.952  21.786   9.225 
   user  system elapsed 
 25.150  23.039  10.332 
   user  system elapsed 
 25.382  31.012  17.995 
   user  system elapsed 
 25.309  25.140  12.356 
   user  system elapsed 
 25.302  26.975  13.938 
   user  system elapsed 
 25.509  34.390  21.134 
   user  system elapsed 
 25.674  28.195  15.048 
   user  system elapsed 
 25.031  28.094  16.449 
   user  system elapsed 
 25.825  37.165  23.379 
# total time:
   user  system elapsed 
256.178 299.671 175.119 
{code}

Interestingly, doing something as unrelated as 
{{arrow:::default_memory_pool()}} which is [only getting the default memory 
pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
 Other interactions totally unrelated to the table also similarly alleviate 
this behavior (e.g. {{empty_tab <- Table$create(data.frame())}})

{code:r}
> system.time({
+   for (i in 1:10) {
+     print(system.time(tab <- 
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+     pool <- arrow:::default_memory_pool()
+     tab <- NULL
+   }
+ })
   user  system elapsed 
 25.257  19.475   6.785 
   user  system elapsed 
 25.271  19.838   6.821 
   user  system elapsed 
 25.288  20.103   6.861 
   user  system elapsed 
 25.188  20.290   7.217 
   user  system elapsed 
 25.283  20.043   6.832 
   user  system elapsed 
 25.194  19.947   6.906 
   user  system elapsed 
 25.278  19.993   6.834 
   user  system elapsed 
 25.355  20.018   6.833 
   user  system elapsed 
 24.986  19.869   6.865 
   user  system elapsed 
 25.130  19.878   6.798 
# total time:
   user  system elapsed 
255.381 210.598  83.109 ​
> 
{code}


I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
same behavior.

I checked against pyarrow, and do not see the same:

{code:python}
from pyarrow import csv
import time

for i in range(1, 10):
    start = time.time()
    table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
    print(time.time() - start)
    del table
{code}

results:

{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code} 


> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
>                 Key: ARROW-11433
>                 URL: https://issues.apache.org/jira/browse/ARROW-11433
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Jonathan Keane
>            Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can 
> tell this only impacts R, and only when reading the csv into arrow (but not 
> pulling it in to R). It appears that most arrow interactions after the csv is 
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and 
> longer (frequently in a stair step pattern where every other iteration takes 
> longer).
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> +     print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> +     tab <- NULL
> +   }
> + })
>    user  system elapsed 
>  24.788  19.485   7.216 
>    user  system elapsed 
>  24.952  21.786   9.225 
>    user  system elapsed 
>  25.150  23.039  10.332 
>    user  system elapsed 
>  25.382  31.012  17.995 
>    user  system elapsed 
>  25.309  25.140  12.356 
>    user  system elapsed 
>  25.302  26.975  13.938 
>    user  system elapsed 
>  25.509  34.390  21.134 
>    user  system elapsed 
>  25.674  28.195  15.048 
>    user  system elapsed 
>  25.031  28.094  16.449 
>    user  system elapsed 
>  25.825  37.165  23.379 
> # total time:
>    user  system elapsed 
> 256.178 299.671 175.119 
> {code}
> Interestingly, doing something as unrelated as 
> {{arrow:::default_memory_pool()}} which is [only getting the default memory 
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
>  Other interactions totally unrelated to the table also similarly alleviate 
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or 
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> +   for (i in 1:10) {
> +     print(system.time(tab <- 
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> +     pool <- arrow:::default_memory_pool()
> +     tab <- NULL
> +   }
> + })
>    user  system elapsed 
>  25.257  19.475   6.785 
>    user  system elapsed 
>  25.271  19.838   6.821 
>    user  system elapsed 
>  25.288  20.103   6.861 
>    user  system elapsed 
>  25.188  20.290   7.217 
>    user  system elapsed 
>  25.283  20.043   6.832 
>    user  system elapsed 
>  25.194  19.947   6.906 
>    user  system elapsed 
>  25.278  19.993   6.834 
>    user  system elapsed 
>  25.355  20.018   6.833 
>    user  system elapsed 
>  24.986  19.869   6.865 
>    user  system elapsed 
>  25.130  19.878   6.798 
> # total time:
>    user  system elapsed 
> 255.381 210.598  83.109 ​
> > 
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the 
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
>     start = time.time()
>     table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
>     print(time.time() - start)
>     del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to