Jonathan Keane created ARROW-11433:
--------------------------------------
Summary: [R] Unexpectedly slow results reading csv
Key: ARROW-11433
URL: https://issues.apache.org/jira/browse/ARROW-11433
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Jonathan Keane
This came up working on benchmarking Arrow's CSV reading. As far as I can tell
this only impacts R, and only when reading the csv into arrow (but not pulling
it in to R). It appears that most arrow interactions after the csv is read will
result in this behavior not happening.
What I'm seeing is that on subsequent reads, the time to read gets longer and
longer (frequently in a stair step pattern where every other iteration takes
longer).
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ tab <- NULL
+ }
+ })
user system elapsed
24.788 19.485 7.216
user system elapsed
24.952 21.786 9.225
user system elapsed
25.150 23.039 10.332
user system elapsed
25.382 31.012 17.995
user system elapsed
25.309 25.140 12.356
user system elapsed
25.302 26.975 13.938
user system elapsed
25.509 34.390 21.134
user system elapsed
25.674 28.195 15.048
user system elapsed
25.031 28.094 16.449
user system elapsed
25.825 37.165 23.379
# total time:
user system elapsed
256.178 299.671 175.119
{code}
Interestingly, doing something as unrelated as
{{arrow:::default_memory_pool()}} which is [only getting the default memory
pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
Other interactions totally unrelated to the table also similarly alleviate
this behavior (e.g. {{empty_tab <- Table$create(data.frame())}})
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <-
read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ pool <- arrow:::default_memory_pool()
+ tab <- NULL
+ }
+ })
user system elapsed
25.257 19.475 6.785
user system elapsed
25.271 19.838 6.821
user system elapsed
25.288 20.103 6.861
user system elapsed
25.188 20.290 7.217
user system elapsed
25.283 20.043 6.832
user system elapsed
25.194 19.947 6.906
user system elapsed
25.278 19.993 6.834
user system elapsed
25.355 20.018 6.833
user system elapsed
24.986 19.869 6.865
user system elapsed
25.130 19.878 6.798
# total time:
user system elapsed
255.381 210.598 83.109
>
{code}
I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the
same behavior.
I checked against pyarrow, and do not see the same:
{code:python}
from pyarrow import csv
import time
for i in range(1, 10):
start = time.time()
table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
print(time.time() - start)
del table
{code}
results:
{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)