[ 
https://issues.apache.org/jira/browse/ARROW-16641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541596#comment-17541596
 ] 

Will Jones edited comment on ARROW-16641 at 5/24/22 4:39 PM:
-------------------------------------------------------------

I don't think there's a compute function that does what you want directly, but 
you should be able to achieve this by flattening the list, doing the filter, 
and aggregating on the indices to get a filter vector. 

Is the following example helpful?
{code:r}
library(arrow)
library(dplyr)

# Filter `tab` for and `tab$x` in `valid`
valid <- Array$create(c(2))

tab <- arrow_table(
  x = Array$create(list(c(1, 2), c(3, 2), c(1, 3))),
  y = Array$create(c("a", "b", "c"))
)

tab_exploded <- arrow_table(
  i = call_function("list_parent_indices", tab$x),
  x_flat = call_function("list_flatten", tab$x)
)

to_keep <- tab_exploded %>%
  group_by(i) %>%
  summarise(keep = any(x_flat %in% valid)) %>%
  compute() %>%
  .$keep

res <- tab[to_keep,]
as_tibble(res)
#> # A tibble: 2 × 2
#>                x y    
#>   <list<double>> <chr>
#> 1            [2] a    
#> 2            [2] b
res$x
#> ChunkedArray
#> [
#>   [
#>     [
#>       1,
#>       2
#>     ],
#>     [
#>       3,
#>       2
#>     ]
#>   ]
#> ]
{code}


was (Author: willjones127):
I don't think there's a compute function that does what you want directly, but 
you should be able to achieve this by flattening the list, doing the filter, 
and aggregating on the indices to get a filter vector. 

Is the following example helpful?
{code:r}
library(arrow)
library(dplyr)

# Filter `tab` for and `tab$x` in `valid`
valid <- Array$create(c(2))

tab <- arrow_table(
  x = Array$create(list(c(1, 2), c(3, 2), c(1, 3))),
  y = Array$create(c("a", "b", "c"))
)

tab_exploded <- arrow_table(
  i = call_function("list_parent_indices", tab$x),
  x_flat = tab$x$chunk(0)$values()
)

to_keep <- tab_exploded %>%
  group_by(i) %>%
  summarise(keep = any(x_flat %in% valid)) %>%
  compute() %>%
  .$keep

res <- tab[to_keep,]
as_tibble(res)
#> # A tibble: 2 × 2
#>                x y    
#>   <list<double>> <chr>
#> 1            [2] a    
#> 2            [2] b
res$x
#> ChunkedArray
#> [
#>   [
#>     [
#>       1,
#>       2
#>     ],
#>     [
#>       3,
#>       2
#>     ]
#>   ]
#> ]
{code}

> [R] How to filter array columns?
> --------------------------------
>
>                 Key: ARROW-16641
>                 URL: https://issues.apache.org/jira/browse/ARROW-16641
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: R
>            Reporter: Vladimir
>            Priority: Minor
>             Fix For: 8.0.0
>
>
> In the parquet data we have, there is a column with the array data type 
> ({*}list<array_element <string>>{*}), which flags records that have different 
> issues. For each record, multiple values could be stored in the column. For 
> example, `{_}[A, B, C]{_}`.
> I'm trying to perform a data filtering step and exclude some flagged records.
> Filtering is trivial for the regular columns that contain just a single 
> value. E.g.,
> {code:java}
> flags_to_exclude <- c("A", "B")
> datt %>% filter(! col %in% flags_to_exclude)
> {code}
> Given the array column, is it possible to exclude records with at least one 
> of the flags from `flags_to_exclude` using the arrow R package?
> I really appreciate any advice you can provide!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to