[
https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277544#comment-17277544
]
Neal Richardson commented on ARROW-11433:
-----------------------------------------
"Only on mac" and "freeing memory" makes me think of ARROW-6994 and the various
issues linked to that (see also
https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L67-L85).
I forget, have you tried using the system memory allocator?
> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Jonathan Keane
> Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can
> tell this only impacts R, and only when reading the csv into arrow (but not
> pulling it in to R). It appears that most arrow interactions after the csv is
> read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and
> longer (frequently in a stair step pattern where every other iteration takes
> longer).
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <-
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> + }
> + })
> user system elapsed
> 24.788 19.485 7.216
> user system elapsed
> 24.952 21.786 9.225
> user system elapsed
> 25.150 23.039 10.332
> user system elapsed
> 25.382 31.012 17.995
> user system elapsed
> 25.309 25.140 12.356
> user system elapsed
> 25.302 26.975 13.938
> user system elapsed
> 25.509 34.390 21.134
> user system elapsed
> 25.674 28.195 15.048
> user system elapsed
> 25.031 28.094 16.449
> user system elapsed
> 25.825 37.165 23.379
> # total time:
> user system elapsed
> 256.178 299.671 175.119
> {code}
> Interestingly, doing something as unrelated as
> {{arrow:::default_memory_pool()}} which is [only getting the default memory
> pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70].
> Other interactions totally unrelated to the table also similarly alleviate
> this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or
> proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <-
> read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> + }
> + })
> user system elapsed
> 25.257 19.475 6.785
> user system elapsed
> 25.271 19.838 6.821
> user system elapsed
> 25.288 20.103 6.861
> user system elapsed
> 25.188 20.290 7.217
> user system elapsed
> 25.283 20.043 6.832
> user system elapsed
> 25.194 19.947 6.906
> user system elapsed
> 25.278 19.993 6.834
> user system elapsed
> 25.355 20.018 6.833
> user system elapsed
> 24.986 19.869 6.865
> user system elapsed
> 25.130 19.878 6.798
> # total time:
> user system elapsed
> 255.381 210.598 83.109
> >
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the
> same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)