Zsolt Kegyes-Brassai created ARROW-14583:
--------------------------------------------

             Summary: RStudio IDE crash
                 Key: ARROW-14583
                 URL: https://issues.apache.org/jira/browse/ARROW-14583
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 6.0.0
         Environment: I am using a windows 10 machine, R 4.1.0, up to date R 
packages, and latest RStudio IDE.
            Reporter: Zsolt Kegyes-Brassai


I was trying the new features introduced in latest {{arrow (6.0.2)}} package 
based on examples from the “New Directions for Apache Arrow” talk.

The RStudio IDE was crashing and the R session was aborted.

Looking closely I found that I downloaded only 2 years of data (2018 & 2019) 
and after the first filter ({{year == 2015}}) no data remains to be processed 
further.

After some debugging, by replacing the collect() function, it turns out that 
the {{summarize()}} is the one which function is causing the crash.

 
{code:java}
as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
                                partitioning = c("year", "month")) %>%
  filter(total_amount > 100 & year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 5000) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect(){code}
 

I would expect to get an error message (without crashing the IDE), which can be 
handled in code.

Another alternative result would be an empty data.frame, like in case when the 
parquet file was read in as a data.frame. I simulated this situation by setting 
a high {{total_amount}} value when filtering. Note: when using an Arrow table 
an error message is generated.

 
{code:java}
 library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.1
#> Warning: package 'tidyr' was built under R version 4.1.1
#> Warning: package 'readr' was built under R version 4.1.1
library(arrow)
#> Warning: package 'arrow' was built under R version 4.1.1
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = FALSE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> Error: Invalid: Must pass at least one array


read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = TRUE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> # A tibble: 0 x 3
#> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to