[jira] [Commented] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently

Ben Kietzman (Jira) Thu, 07 Jan 2021 12:54:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260817#comment-17260817
 ]


Ben Kietzman commented on ARROW-10485:
--------------------------------------

To be clear: the issue here stems from the dataset-on-disk being 
hive-partitioned, so directories look like /{{cyl=4/part-0.parquet}}. When 
reading with a *directory* partitioning (which what results when a character 
vector is specified), this results in string values like "cyl=4" for field 
"cyl"; obviously suspect to a human but technically valid. If you construct a 
{{HivePartitioning}} explicitly this issue should not arise
 

> open_dataset(): specifying partition when hive_style =TRUE fails silently
> -------------------------------------------------------------------------
>
>                 Key: ARROW-10485
>                 URL: https://issues.apache.org/jira/browse/ARROW-10485
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0
>         Environment: MacOS Catalina 10.15.7 (19H2), R 4.01, arrow R package 
> v2.0.0
>            Reporter: John Sheffield
>            Assignee: Ben Kietzman
>            Priority: Minor
>
> When writing a dataset with hive_style = TRUE, now the default, that dataset 
> has to be opened without an explicit definition of the partitions to work as 
> expected. Even if the correct partition is specified, any query to the 
> dataset on the partition field returns 0 rows.
>  
> From my eyes as a user, I'd want this to error out specifically (not just 
> warn), probably when first calling open_dataset().
> {code:r}
> data("mtcars")
> arrow::write_dataset(
>     dataset = mtcars, path = "mtcarstest", partitioning = "cyl",
>     format = "parquet", hive_style = TRUE)
> mtc1 <- arrow::open_dataset("mtcarstest", partitioning = "cyl")
> mtc2 <- arrow::open_dataset("mtcarstest")
> mtc1 %>%
>      dplyr::filter(cyl == 4) %>%
>      collect()
> mtc2 %>%
>      dplyr::filter(cyl == 4) %>%
>      collect()
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10485) open_dataset(): specifying partition when hive_style =TRUE fails silently

Reply via email to