[ https://issues.apache.org/jira/browse/ARROW-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-9606: ----------------------------------- Priority: Major (was: Critical) > [C++][Dataset] in expressions don't work with >1 partition levels > ----------------------------------------------------------------- > > Key: ARROW-9606 > URL: https://issues.apache.org/jira/browse/ARROW-9606 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 1.0.0 > Environment: This is using the latest Github version using windows, > but I also reproduce using the CRAN version and using Linux. > {code} > sessionInfo() > #> R version 4.0.2 (2020-06-22) > #> Platform: x86_64-w64-mingw32/x64 (64-bit) > #> Running under: Windows 10 x64 (build 19041) > #> > #> Matrix products: default > #> > #> locale: > #> [1] LC_COLLATE=English_United Kingdom.1252 > #> [2] LC_CTYPE=English_United Kingdom.1252 > #> [3] LC_MONETARY=English_United Kingdom.1252 > #> [4] LC_NUMERIC=C > #> [5] LC_TIME=English_United Kingdom.1252 > #> > #> attached base packages: > #> [1] stats graphics grDevices utils datasets methods base > #> > #> other attached packages: > #> [1] dplyr_1.0.0 arrow_1.0.0.9000 > {code} > Reporter: Maarten Demeyer > Priority: Major > > When filtering nested partitions using %in%, no rows are returned, both for > Hive and non-Hive partitioning. == and other comparison operators do work, > and the problem also goes away when only one partition level is declared in > the schema. > This is not caused by the dplyr wrappers, the lower-level functions have the > same problem. > {code} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > ## Write files > pqdir <- file.path(tempdir(), paste(sample(letters, 6), collapse = "")) > for (foo in 0:1) { > for (faa in 0:1) { > fdir <- file.path(pqdir, letters[foo + 1], letters[faa + 1]) > dir.create(fdir, recursive = TRUE) > rng <- (foo * 5 + faa + 1):(foo * 5 + faa + 5) > write_parquet(data.frame(col = letters[rng]), > file.path(fdir, "file.parquet")) > } > } > ## What doesn't work: using %in% with both partitions defined > ds <- open_dataset(pqdir, > partitioning = schema(foo = string(), faa = string())) > collect(filter(ds, foo %in% "a")) > #> # A tibble: 0 x 3 > #> # ... with 3 variables: col <chr>, foo <chr>, faa <chr> > ## == does work > collect(filter(ds, foo == "a")) > #> # A tibble: 10 x 3 > #> col foo faa > #> <chr> <chr> <chr> > #> 1 a a a > #> 2 b a a > #> 3 c a a > #> 4 d a a > #> 5 e a a > #> 6 b a b > #> 7 c a b > #> 8 d a b > #> 9 e a b > #> 10 f a b > ## Declaring only one partition does work > ds <- open_dataset(pqdir, partitioning = schema(foo = string())) > collect(filter(ds, foo %in% "a")) > #> # A tibble: 10 x 2 > #> col foo > #> <chr> <chr> > #> 1 a a > #> 2 b a > #> 3 c a > #> 4 d a > #> 5 e a > #> 6 b a > #> 7 c a > #> 8 d a > #> 9 e a > #> 10 f a > ## The lower-level API has the same problem > ds <- open_dataset(pqdir, > partitioning = schema(foo = string(), faa = string())) > flt <- Expression$in_(Expression$field_ref("foo"), Array$create("a")) > sc <- Scanner$create(ds, filter = flt) > sc$ToTable() > #> Table > #> 0 rows x 3 columns > #> $col <string> > #> $foo <string> > #> $faa <string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)