[
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane resolved ARROW-14653.
----------------------------------
Resolution: Fixed
Issue resolved by pull request 11992
[https://github.com/apache/arrow/pull/11992]
> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
> Key: ARROW-14653
> URL: https://issues.apache.org/jira/browse/ARROW-14653
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Nicola Crane
> Assignee: Nicola Crane
> Priority: Major
> Labels: pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> I'm calling {{head()}} on a CSV dataset containing CSV files. I'm doing this
> as I want to preview my dataset before I try to do anything with it that's
> going to be more expensive computationally.
> {code:r}
> library(arrow)
> library(dplyr)
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
> head(1) %>%
> collect()
> {code}
> I have experimented with different combinations of files in the dataset
> folder, and it seems to work fine when my total file size is <~600Mb but hang
> if it's above that. This might not even be what that actual issue is but I'm
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of
> files in fine, but when using all of them, the session hangs.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)