[
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Keane updated ARROW-11433:
-----------------------------------
Description:
This came up working on benchmarking Arrow's CSV reading. As far as I can tell
this only impacts R, and only when reading the csv into arrow (but not pulling
it in to R). It appears that most arrow interactions after the csv is read will
result in this behavior not happening.
What I'm seeing is that on subsequent reads, the time to read gets longer and
longer (frequently in a stair step pattern where every other iteration takes
longer).
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ tab <- NULL
+ }
+ })
user system elapsed
24.788 19.485 7.216
user system elapsed
24.952 21.786 9.225
user system elapsed
25.150 23.039 10.332
user system elapsed
25.382 31.012 17.995
user system elapsed
25.309 25.140 12.356
user system elapsed
25.302 26.975 13.938
user system elapsed
25.509 34.390 21.134
user system elapsed
25.674 28.195 15.048
user system elapsed
25.031 28.094 16.449
user system elapsed
25.825 37.165 23.379
# total time:
user system elapsed
256.178 299.671 175.119
{code}
Interestingly, doing something as unrelated as
{{arrow:::default_memory_pool()}} which is [only getting the default memory
pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
Other interactions totally unrelated to the table also similarly alleviate
this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or proactively
invalidating with {{tab$invalidate()}}
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ pool <- arrow:::default_memory_pool()
+ tab <- NULL
+ }
+ })
user system elapsed
25.257 19.475 6.785
user system elapsed
25.271 19.838 6.821
user system elapsed
25.288 20.103 6.861
user system elapsed
25.188 20.290 7.217
user system elapsed
25.283 20.043 6.832
user system elapsed
25.194 19.947 6.906
user system elapsed
25.278 19.993 6.834
user system elapsed
25.355 20.018 6.833
user system elapsed
24.986 19.869 6.865
user system elapsed
25.130 19.878 6.798
# total time:
user system elapsed
255.381 210.598 83.109
>
{code}
I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the
same behavior.
I checked against pyarrow, and do not see the same:
{code:python}
from pyarrow import csv
import time
for i in range(1, 10):
start = time.time()
table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
print(time.time() - start)
del table
{code}
results:
{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code}
was:
This came up working on benchmarking Arrow's CSV reading. As far as I can tell
this only impacts R, and only when reading the csv into arrow (but not pulling
it in to R). It appears that most arrow interactions after the csv is read will
result in this behavior not happening.
What I'm seeing is that on subsequent reads, the time to read gets longer and
longer (frequently in a stair step pattern where every other iteration takes
longer).
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ tab <- NULL
+ }
+ })
user system elapsed
24.788 19.485 7.216
user system elapsed
24.952 21.786 9.225
user system elapsed
25.150 23.039 10.332
user system elapsed
25.382 31.012 17.995
user system elapsed
25.309 25.140 12.356
user system elapsed
25.302 26.975 13.938
user system elapsed
25.509 34.390 21.134
user system elapsed
25.674 28.195 15.048
user system elapsed
25.031 28.094 16.449
user system elapsed
25.825 37.165 23.379
# total time:
user system elapsed
256.178 299.671 175.119
{code}
Interestingly, doing something as unrelated as
{{arrow:::default_memory_pool()}} which is [only getting the default memory
pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
Other interactions totally unrelated to the table also similarly alleviate
this behavior (e.g. {{empty_tab <- Table$create(data.frame())}})
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ pool <- arrow:::default_memory_pool()
+ tab <- NULL
+ }
+ })
user system elapsed
25.257 19.475 6.785
user system elapsed
25.271 19.838 6.821
user system elapsed
25.288 20.103 6.861
user system elapsed
25.188 20.290 7.217
user system elapsed
25.283 20.043 6.832
user system elapsed
25.194 19.947 6.906
user system elapsed
25.278 19.993 6.834
user system elapsed
25.355 20.018 6.833
user system elapsed
24.986 19.869 6.865
user system elapsed
25.130 19.878 6.798
# total time:
user system elapsed
255.381 210.598 83.109
>
{code}
I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the
same behavior.
I checked against pyarrow, and do not see the same:
{code:python}
from pyarrow import csv
import time
for i in range(1, 10):
start = time.time()
table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
print(time.time() - start)
del table
{code}
results:
{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code}
> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Jonathan Keane
> Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can
> tell this only impacts R, and only when reading the csv into arrow (but not
> pulling it in to R). It appears that most arrow interactions after the csv is
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and
> longer (frequently in a stair step pattern where every other iteration takes
> longer).
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <-
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> + }
> + })
> user system elapsed
> 24.788 19.485 7.216
> user system elapsed
> 24.952 21.786 9.225
> user system elapsed
> 25.150 23.039 10.332
> user system elapsed
> 25.382 31.012 17.995
> user system elapsed
> 25.309 25.140 12.356
> user system elapsed
> 25.302 26.975 13.938
> user system elapsed
> 25.509 34.390 21.134
> user system elapsed
> 25.674 28.195 15.048
> user system elapsed
> 25.031 28.094 16.449
> user system elapsed
> 25.825 37.165 23.379
> # total time:
> user system elapsed
> 256.178 299.671 175.119
> {code}
> Interestingly, doing something as unrelated as
> {{arrow:::default_memory_pool()}} which is [only getting the default memory
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
> Other interactions totally unrelated to the table also similarly alleviate
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <-
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> + }
> + })
> user system elapsed
> 25.257 19.475 6.785
> user system elapsed
> 25.271 19.838 6.821
> user system elapsed
> 25.288 20.103 6.861
> user system elapsed
> 25.188 20.290 7.217
> user system elapsed
> 25.283 20.043 6.832
> user system elapsed
> 25.194 19.947 6.906
> user system elapsed
> 25.278 19.993 6.834
> user system elapsed
> 25.355 20.018 6.833
> user system elapsed
> 24.986 19.869 6.865
> user system elapsed
> 25.130 19.878 6.798
> # total time:
> user system elapsed
> 255.381 210.598 83.109
> >
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)