John Sheffield created ARROW-7740:
-------------------------------------

             Summary: R arrow::read_json_arrow aborts session with nested 
ndjson and default as_data_frame=TRUE
                 Key: ARROW-7740
                 URL: https://issues.apache.org/jira/browse/ARROW-7740
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: John Sheffield


Reading a nested ndjson file using arrow::read_json_arrow with the default 
`as_data_frame=TRUE` causes an immediate session crash, but switching to 
`as_data_frame=FALSE` works fine and the resulting arrow object schema is 
correct.

 

 
{code:java}
library(tidyr)
library(arrow)
library(jsonlite)
# Create two test datasets: long_df and a variant that nests long_df into
# a dataframe with a list-column 'nest_level1' containing a dataframe
long_df <- tidyr::expand_grid(ABC = LETTERS[1:3], xyz = letters[24:26], num = 
1:3)
long_df[["ftr1"]] <- runif(nrow(long_df))
long_df[["ftr2"]] <- rpois(nrow(long_df), 100)
nested_frame_level1 <- tidyr::nest(long_df, nest_level1 = c(num, ftr1, ftr2))
# Write and validate nested ndjson
jsonlite::stream_out(nested_frame_level1, con = 
file("nested_frame_level1.json"))
readLines("nested_frame_level1.json", n = 2) # check we have valid ndjson here
# This does not cause a session crash
nested_arrow <- arrow::read_json_arrow(file = "nested_frame_level1.json", 
as_data_frame = FALSE)
nested_arrow$schema # correctly interprets 'nest_level1` as `list<item: 
struct<num: int64, ftr1: double, ftr2: int64>>`
# This causes a session crash
nested_df <- arrow::read_json_arrow(file = "nested_frame_level1.json", 
as_data_frame = TRUE)
 
{code}
 

The R package version of Arrow is latest CRAN release (arrow * 0.15.1.1, 
2019-11-05, CRAN (R 3.5.2)). I'm running this code in a slightly older R 
version (3.5.1), macOS 10.14.6, x86_64, darwin15.6.0, via RStudio 1.2.5001. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to