[
https://issues.apache.org/jira/browse/ARROW-9606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170652#comment-17170652
]
Joris Van den Bossche commented on ARROW-9606:
----------------------------------------------
I see this as well with a Python reproducer:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# works OK with a single partition field
table = pa.table({'part': ['a', 'b']*10, 'col': np.arange(20)})
pq.write_to_dataset(table, "test_partitioned_filter_in",
partition_cols=["part"])
dataset = ds.dataset("test_partitioned_filter_in/", partitioning="hive")
In [41]: dataset.to_table(filter=ds.field("part").isin(["a"])).to_pandas()
Out[41]:
col part
0 0 a
1 2 a
2 4 a
3 6 a
4 8 a
5 10 a
6 12 a
7 14 a
8 16 a
9 18 a
# fails with multiple partition columns
table = pa.table({'part1': ['a', 'b']*10, 'part2': [1, 1, 2, 2]*5, 'col':
np.arange(20)})
pq.write_to_dataset(table, "test_partitioned_filter_in2",
partition_cols=["part1", "part2"])
dataset2 = ds.dataset("test_partitioned_filter_in2/", partitioning="hive")
In [45]: dataset2.to_table(filter=ds.field("part1").isin(["a"])).to_pandas()
Out[45]:
Empty DataFrame
Columns: [col, part1, part2]
Index: []
In [46]: dataset2.to_table(filter=ds.field("part1") == "a").to_pandas()
Out[46]:
col part1 part2
0 0 a 1
1 4 a 1
2 8 a 1
3 12 a 1
4 16 a 1
5 2 a 2
6 6 a 2
7 10 a 2
8 14 a 2
9 18 a 2
{code}
> [C++][Dataset] in expressions don't work with >1 partition levels
> -----------------------------------------------------------------
>
> Key: ARROW-9606
> URL: https://issues.apache.org/jira/browse/ARROW-9606
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 1.0.0
> Environment: This is using the latest Github version using windows,
> but I also reproduce using the CRAN version and using Linux.
> {code}
> sessionInfo()
> #> R version 4.0.2 (2020-06-22)
> #> Platform: x86_64-w64-mingw32/x64 (64-bit)
> #> Running under: Windows 10 x64 (build 19041)
> #>
> #> Matrix products: default
> #>
> #> locale:
> #> [1] LC_COLLATE=English_United Kingdom.1252
> #> [2] LC_CTYPE=English_United Kingdom.1252
> #> [3] LC_MONETARY=English_United Kingdom.1252
> #> [4] LC_NUMERIC=C
> #> [5] LC_TIME=English_United Kingdom.1252
> #>
> #> attached base packages:
> #> [1] stats graphics grDevices utils datasets methods base
> #>
> #> other attached packages:
> #> [1] dplyr_1.0.0 arrow_1.0.0.9000
> {code}
> Reporter: Maarten Demeyer
> Priority: Major
> Labels: dataset
> Fix For: 2.0.0
>
>
> When filtering nested partitions using %in%, no rows are returned, both for
> Hive and non-Hive partitioning. == and other comparison operators do work,
> and the problem also goes away when only one partition level is declared in
> the schema.
> This is not caused by the dplyr wrappers, the lower-level functions have the
> same problem.
> {code}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> library(dplyr)
> #>
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #>
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #>
> #> intersect, setdiff, setequal, union
> ## Write files
> pqdir <- file.path(tempdir(), paste(sample(letters, 6), collapse = ""))
> for (foo in 0:1) {
> for (faa in 0:1) {
> fdir <- file.path(pqdir, letters[foo + 1], letters[faa + 1])
> dir.create(fdir, recursive = TRUE)
> rng <- (foo * 5 + faa + 1):(foo * 5 + faa + 5)
> write_parquet(data.frame(col = letters[rng]),
> file.path(fdir, "file.parquet"))
> }
> }
> ## What doesn't work: using %in% with both partitions defined
> ds <- open_dataset(pqdir,
> partitioning = schema(foo = string(), faa = string()))
> collect(filter(ds, foo %in% "a"))
> #> # A tibble: 0 x 3
> #> # ... with 3 variables: col <chr>, foo <chr>, faa <chr>
> ## == does work
> collect(filter(ds, foo == "a"))
> #> # A tibble: 10 x 3
> #> col foo faa
> #> <chr> <chr> <chr>
> #> 1 a a a
> #> 2 b a a
> #> 3 c a a
> #> 4 d a a
> #> 5 e a a
> #> 6 b a b
> #> 7 c a b
> #> 8 d a b
> #> 9 e a b
> #> 10 f a b
> ## Declaring only one partition does work
> ds <- open_dataset(pqdir, partitioning = schema(foo = string()))
> collect(filter(ds, foo %in% "a"))
> #> # A tibble: 10 x 2
> #> col foo
> #> <chr> <chr>
> #> 1 a a
> #> 2 b a
> #> 3 c a
> #> 4 d a
> #> 5 e a
> #> 6 b a
> #> 7 c a
> #> 8 d a
> #> 9 e a
> #> 10 f a
> ## The lower-level API has the same problem
> ds <- open_dataset(pqdir,
> partitioning = schema(foo = string(), faa = string()))
> flt <- Expression$in_(Expression$field_ref("foo"), Array$create("a"))
> sc <- Scanner$create(ds, filter = flt)
> sc$ToTable()
> #> Table
> #> 0 rows x 3 columns
> #> $col <string>
> #> $foo <string>
> #> $faa <string>
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)