[
https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260835#comment-17260835
]
Neal Richardson commented on ARROW-10485:
-----------------------------------------
"Constructing a {{HivePartitioning}}" is what happens when you omit the
"partitioning" argument.
I agree with [~jms] that this isn't great. It's not unreasonable to expect that
{{open_dataset("mtcarstest", partitioning = "cyl")}} should do the right thing
if the column name matches the hive key. As it turns out, that's not so
trivial, and supporting that introduces some other API ambiguities. We'll need
to think about this some more.
> open_dataset(): specifying partition when hive_style =TRUE fails silently
> -------------------------------------------------------------------------
>
> Key: ARROW-10485
> URL: https://issues.apache.org/jira/browse/ARROW-10485
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 2.0.0
> Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package
> v2.0.0
> Reporter: John Sheffield
> Assignee: Ben Kietzman
> Priority: Minor
>
> When writing a dataset with hive_style = TRUE, now the default, that dataset
> has to be opened without an explicit definition of the partitions to work as
> expected. Even if the correct partition is specified, any query to the
> dataset on the partition field returns 0 rows.
>
> From my eyes as a user, I'd want this to error out specifically (not just
> warn), probably when first calling open_dataset().
> {code:r}
> data("mtcars")
> arrow::write_dataset(
> dataset = mtcars, path = "mtcarstest", partitioning = "cyl",
> format = "parquet", hive_style = TRUE)
> mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
> mtc2 <- arrow::open_dataset("mtcarstest")
> mtc1 %>%
> dplyr::filter(cyl == 4) %>%
> collect()
> mtc2 %>%
> dplyr::filter(cyl == 4) %>%
> collect()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)