[jira] [Commented] (ARROW-8748) [R] Implementing methodes for combining arrow tabels using dplyr::bind_rows and dplyr::bind_cols

Dominic Dennenmoser (Jira) Tue, 12 May 2020 02:42:02 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105284#comment-17105284
 ]


Dominic Dennenmoser commented on ARROW-8748:
--------------------------------------------

Your solution seems a nicer workaround than mine. I will definitively look into 
it. (y)

However, concatenate tables into Arrow memory would have the advantages of 
writing a new file without loading all table into memory, and linking (already 
processed) tables into one unit for further processing. I think, a 
{{dplyr::bind_cols()}} and {{dplyr::bind_rows()}} method would be a flexible 
extension to {{arrow::open_dataset()}}, which need predefined folder structure. 
(Please correct me if I haven't understand it correctly.)

> [R] Implementing methodes for combining arrow tabels using dplyr::bind_rows 
> and dplyr::bind_cols
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8748
>                 URL: https://issues.apache.org/jira/browse/ARROW-8748
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>            Reporter: Dominic Dennenmoser
>            Priority: Major
>              Labels: features, performance
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> First at all, many thanks for your hard work! I was quite exited, when you 
> guys implemented some basic function of the the {{dplyr}} package. Is there a 
> why to combine tow or more arrow tables into one by rows or columns? At the 
> moment my workaround looks like this:
> {code:r}
> dplyr::bind_rows(
>    "a" = arrow.table.1 %>% dplyr::collect(),
>    "b" = arrow.table.2 %>% dplyr::collect(),
>    "c" = arrow.table.3 %>% dplyr::collect(),
>    "d" = arrow.table.4 %>% dplyr::collect(),
>    .id = "ID"
>  ) %>% 
>  arrow::write_ipc_stream(sink = "file_name_combined_tables.arrow")
> {code}
> But this is actually not really a meaningful measure because of putting the 
> data back as dataframes/tibbles into the r environment, which might lead to 
> an exhaust of RAM space. Perhaps you might have a better workaround on hand. 
> It would be great if you guys could implement the {{bind_rows}} and 
> {{bind_cols}} methods provided by {{dplyr}}.
> {code:java}
> dplyr::bind_rows(
>    "a" = arrow.table.1,
>    "b" = arrow.table.2,
>    "c" = arrow.table.3,
>    "d" = arrow.table.4, 
>    .id = "ID"
> ) %>% 
>  arrow::write_ipc_stream(sink = "file_name_combined_tables.arrow"){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8748) [R] Implementing methodes for combining arrow tabels using dplyr::bind_rows and dplyr::bind_cols

Reply via email to