[
https://issues.apache.org/jira/browse/ARROW-17802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607847#comment-17607847
]
N Gautam Animesh commented on ARROW-17802:
------------------------------------------
Actually, I want to merge two or more arrow files after reading them through
open_dataset(). By default, it is binding them row-wise.
But my use case is to merge the files based on a specified column that is
common to them. For example: file1 and file2 contain 5 columns out of which
they have one column in common.
I want a resulting data frame that contains 9 columns (5 from file1 and 4 from
file2 since 1 column from file1 is repeated in file2).
I hope I am able to explain my use case.
Do let me know if there's anything else required or any other work around that
would achieve the same functionality.
> [R] Merging multi file datasets on particular columns that are present in all
> the datasets.
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-17802
> URL: https://issues.apache.org/jira/browse/ARROW-17802
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: N Gautam Animesh
> Priority: Major
>
> While working with multi file datasets, I came across an issue where I wanted
> to merge specific columns from all the datasets and work on them.
> Though I was not able to do so, I want to know whether there is any work
> around for merging multi file datasets around some specific columns?
> Please look into it and do let me know if there's anything regarding this.
> {code:java}
> system.time({
> df <- open_dataset('C:/Test/Files/test', format = "arrow")
> df <- df %>% collect() %>%
> #merging logic so as to select only specified column(s)
> #write_dataset(df, 'C:/Test/Files/test', format = "arrow")
> }) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)