[jira] [Commented] (ARROW-14583) [R][C++] Crash when summarizing after filtering to no rows on partitioned data

Nicola Crane (Jira) Sun, 07 Nov 2021 08:22:10 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440036#comment-17440036
 ]


Nicola Crane commented on ARROW-14583:
--------------------------------------

I've since been playing around in R and found that even without filtering, I 
get a crash when doing group_by + summarise on partitioned data, e.g. I get a 
segfault from the below code

 
{code:java}
library(arrow)
library(dplyr)

write_dataset(group_by(iris, Species), "iris_data")

open_dataset("iris_data") %>%
  group_by(Species) %>%
  summarise(mean(Sepal.Length)) %>%
  collect() {code}

> [R][C++] Crash when summarizing after filtering to no rows on partitioned data
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-14583
>                 URL: https://issues.apache.org/jira/browse/ARROW-14583
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, R
>    Affects Versions: 6.0.0
>         Environment: I am using a windows 10 machine, R 4.1.0, up to date R 
> packages, and latest RStudio IDE.
>            Reporter: Zsolt Kegyes-Brassai
>            Assignee: David Li
>            Priority: Major
>              Labels: pull-request-available, query-engine
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Original issue report is below; here's an even more minimal example:
> {code:r}
> library(arrow)
> library(dplyr)
> td <- tempfile()
> dir.create(td)
> # if there is no partitioning in data data, this won't segfault
> # write_dataset(iris, td) - swap this in and won't segfault
> write_dataset(group_by(iris, Species), td)
> open_dataset(td) %>%
>   filter(Species == "tulip") %>%
>   group_by(Sepal.Length) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> ----
> I was trying the new features introduced in latest {{arrow (6.0.2)}} package 
> based on examples from the “New Directions for Apache Arrow” talk.
> The RStudio IDE was crashing and the R session was aborted.
> Looking closely I found that I downloaded only 2 years of data (2018 & 2019) 
> and after the first filter ({{year == 2015}}) no data remains to be processed 
> further.
> After some debugging, by replacing the collect() function, it turns out that 
> the {{summarize()}} is the one which function is causing the crash.
>  
> {code:java}
> as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
>                                 partitioning = c("year", "month")) %>%
>   filter(total_amount > 100 & year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 5000) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect(){code}
>  
> I would expect to get an error message (without crashing the IDE), which can 
> be handled in code.
> Another alternative result would be an empty data.frame, like in case when 
> the parquet file was read in as a data.frame. I simulated this situation by 
> setting a high {{total_amount}} value when filtering. Note: when using an 
> Arrow table an error message is generated.
>  
> {code:java}
>  library(tidyverse)
> #> Warning: package 'tibble' was built under R version 4.1.1
> #> Warning: package 'tidyr' was built under R version 4.1.1
> #> Warning: package 'readr' was built under R version 4.1.1
> library(arrow)
> #> Warning: package 'arrow' was built under R version 4.1.1
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = FALSE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> Error: Invalid: Must pass at least one array
> read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
>              as_data_frame = TRUE) %>%
>   # filter(total_amount > 100) %>%
>   filter(total_amount > 1e10) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = tip_amount / total_amount * 100) %>%
>   group_by(passenger_count) %>%
>   summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
>   filter(n > 500) %>%
>   arrange(desc(avg_tip_pct)) %>%
>   collect()
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14583) [R][C++] Crash when summarizing after filtering to no rows on partitioned data

Reply via email to