[jira] [Commented] (ARROW-9676) [R] Error converting Table with nested structs

Nick DiQuattro (Jira) Thu, 13 Aug 2020 16:32:15 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177377#comment-17177377
 ]


Nick DiQuattro commented on ARROW-9676:
---------------------------------------

Makes sense about converting to data.frame not being the issue. I've been 
trying to find a particular row that causes trouble by loading one row at a 
time with the following code:

{{library(arrow)}}
{{library(purrr)}}
{{library(dplyr)}}

{{one_file <- 
read_parquet("part-00001-fd97a5a9-f795-4f28-b09f-798077773be8-c000.snappy.parquet",
 as_data_frame = FALSE)}}

{{convert <- function(index) as.data.frame(one_file$Slice(index, 1))}}
{{safe_con <- safely(convert)}}

{{test <- map(1:10000, safe_con)}}

{{map(test, "error") %>% discard(is_empty)}}
{{detect_index(test, ~!is_empty(.$error))}}

This will sometimes capture a row that generated a similar error as previously 
mentioned, but then when I investigate the row (running convert() on the index 
again), it loads fine. :(

The origin is a pyspark script that is run to convert from newline JSON to 
parquet elsewhere in our pipeline.

I hate to have wasted your time, but I can't seem to reliably replicate the 
error.  

> [R] Error converting Table with nested structs
> ----------------------------------------------
>
>                 Key: ARROW-9676
>                 URL: https://issues.apache.org/jira/browse/ARROW-9676
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>    Affects Versions: 1.0.0
>         Environment: Amazon Linux, 32gb of ram
>            Reporter: Nick DiQuattro
>            Priority: Major
>
> When trying to collect data from a dataset based on parquet files with nested 
> structs (column is a struct with 2 structs nested) of moderate size (1Mish 
> rows), R crashes. If I add a filter to reduce the number of rows, the data is 
> parsed. If I select out the struct column, it works great (up to 21M rows). 
> My hunch is the structs resulting in data.frame columns may be the issue. I 
> am curious if there's a way to have arrow import structs as lists instead of 
> data.frames. Thanks for the direction to here [~neilr8133]!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9676) [R] Error converting Table with nested structs

Reply via email to