[
https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260817#comment-17260817
]
Ben Kietzman commented on ARROW-10485:
--------------------------------------
To be clear: the issue here stems from the dataset-on-disk being
hive-partitioned, so directories look like /{{cyl=4/part-0.parquet}}. When
reading with a *directory* partitioning (which what results when a character
vector is specified), this results in string values like "cyl=4" for field
"cyl"; obviously suspect to a human but technically valid. If you construct a
{{HivePartitioning}} explicitly this issue should not arise
> open_dataset(): specifying partition when hive_style =TRUE fails silently
> -------------------------------------------------------------------------
>
> Key: ARROW-10485
> URL: https://issues.apache.org/jira/browse/ARROW-10485
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 2.0.0
> Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package
> v2.0.0
> Reporter: John Sheffield
> Assignee: Ben Kietzman
> Priority: Minor
>
> When writing a dataset with hive_style = TRUE, now the default, that dataset
> has to be opened without an explicit definition of the partitions to work as
> expected. Even if the correct partition is specified, any query to the
> dataset on the partition field returns 0 rows.
>
> From my eyes as a user, I'd want this to error out specifically (not just
> warn), probably when first calling open_dataset().
> {code:r}
> data("mtcars")
> arrow::write_dataset(
> dataset = mtcars, path = "mtcarstest", partitioning = "cyl",
> format = "parquet", hive_style = TRUE)
> mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
> mtc2 <- arrow::open_dataset("mtcarstest")
> mtc1 %>%
> dplyr::filter(cyl == 4) %>%
> collect()
> mtc2 %>%
> dplyr::filter(cyl == 4) %>%
> collect()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)