[ 
https://issues.apache.org/jira/browse/ARROW-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15731:
-----------------------------------
    Description: 
Currently Arrow joins with data that contain a list column errors, even when 
the list column is not a join key:



{code}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
                   jedi = c(FALSE, TRUE))

arrow_table(starwars) %>%
  left_join(jedi) %>%
  collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Data type list<item: string> is not supported in join non-key 
field
{code}

The ability to join would be a useful enhancement for workflows with tabular 
data where list columns can be common, and for geospatial workflows where 
geometry columns are stored as `list` or `fixed_size_list` (thanks 
[~paleolimbot] for mentioning that use case).

Related discussion here: ARROW-14519

 

  was:
Currently Arrow joins with data that contain a list column errors, even when 
the list column is not a join key:



``` r
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
                   jedi = c(FALSE, TRUE))

arrow_table(starwars) %>%
  left_join(jedi) %>%
  collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Data type list<item: string> is not supported in join non-key 
field
```

The ability to join would be a useful enhancement for workflows with tabular 
data where list columns can be common, and for geospatial workflows where 
geometry columns are stored as `list` or `fixed_size_list` (thanks 
[~paleolimbot] for mentioning that use case).

Related discussion here: https://issues.apache.org/jira/browse/ARROW-14519

 


> [C++] Enable joins when data contains a list column
> ---------------------------------------------------
>
>                 Key: ARROW-15731
>                 URL: https://issues.apache.org/jira/browse/ARROW-15731
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Stephanie Hazlitt
>            Priority: Major
>
> Currently Arrow joins with data that contain a list column errors, even when 
> the list column is not a join key:
> {code}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #>     timestamp
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #>     filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #>     intersect, setdiff, setequal, union
> jedi <- data.frame(name = c("C-3PO", "Luke Skywalker"),
>                    jedi = c(FALSE, TRUE))
> arrow_table(starwars) %>%
>   left_join(jedi) %>%
>   collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Data type list<item: string> is not supported in join non-key 
> field
> {code}
> The ability to join would be a useful enhancement for workflows with tabular 
> data where list columns can be common, and for geospatial workflows where 
> geometry columns are stored as `list` or `fixed_size_list` (thanks 
> [~paleolimbot] for mentioning that use case).
> Related discussion here: ARROW-14519
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to