[
https://issues.apache.org/jira/browse/ARROW-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607890#comment-17607890
]
Nicola Crane commented on ARROW-17802:
--------------------------------------
You can specify a schema to {{open_dataset()}} like this:
{code:r}
library(arrow)
library(dplyr)
# create 2 tables with column `x` in common
tbl1 <- arrow_table(x = 1:3, z = c(TRUE, FALSE, TRUE))
tbl2 <- arrow_table(x = 4:6, y = c("do", "ray", "me"))
# set up temporary directory
tf <- tempfile()
dir.create(tf)
write_parquet(tbl1, file.path(tf, "file_1.parquet"))
write_parquet(tbl2, file.path(tf, "file_2.parquet"))
# open the dataset specifying the schema with both columns
open_dataset(tf, schema = schema(x = int32(), y = string(), z = boolean())) %>%
collect()
#> # A tibble: 6 × 3
#> x y z
#> <int> <chr> <lgl>
#> 1 1 <NA> TRUE
#> 2 2 <NA> FALSE
#> 3 3 <NA> TRUE
#> 4 4 do NA
#> 5 5 ray NA
#> 6 6 me NA
{code}
> [R] Merging multi file datasets on particular columns that are present in all
> the datasets.
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-17802
> URL: https://issues.apache.org/jira/browse/ARROW-17802
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: N Gautam Animesh
> Priority: Major
>
> While working with multi file datasets, I came across an issue where I wanted
> to merge specific columns from all the datasets and work on them.
> Though I was not able to do so, I want to know whether there is any work
> around for merging multi file datasets around some specific columns?
> Please look into it and do let me know if there's anything regarding this.
> {code:java}
> system.time({
> df <- open_dataset('C:/Test/Files/test', format = "arrow")
> df <- df %>% collect() %>%
> #merging logic so as to select only specified column(s)
> #write_dataset(df, 'C:/Test/Files/test', format = "arrow")
> }) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)