[jira] [Resolved] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

Nicola Crane (Jira) Mon, 03 Jan 2022 10:36:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicola Crane resolved ARROW-14653.
----------------------------------
    Resolution: Fixed

Issue resolved by pull request 11992
[https://github.com/apache/arrow/pull/11992]

> [R] head() hangs on CSV datasets > 600MB
> ----------------------------------------
>
>                 Key: ARROW-14653
>                 URL: https://issues.apache.org/jira/browse/ARROW-14653
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Nicola Crane
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> I'm calling {{head()}} on a CSV dataset containing CSV files.  I'm doing this 
> as I want to preview my dataset before I try to do anything with it that's 
> going to be more expensive computationally.
> {code:r}
> library(arrow)
> library(dplyr)
> open_dataset("../../data/nyc-raw/", format = "csv") %>%
>   head(1) %>%
>   collect()
> {code}
> I have experimented with different combinations of files in the dataset 
> folder, and it seems to work fine when my total file size is <~600Mb but hang 
> if it's above that.  This might not even be what that actual issue is but I'm 
> struggling to narrow it down beyond add extra files to the equation.
> I've tried running with with the C++ debugger attached, but again, it just 
> hangs.
> The files I'm using are the 2020-2021 Yellow Taxi trip records available 
> from: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
> A bit of investigation has shown me that I can load in different subsets of 
> files in fine, but when using all of them, the session hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-14653) [R] head() hangs on CSV datasets > 600MB

Reply via email to