[
https://issues.apache.org/jira/browse/ARROW-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104730#comment-17104730
]
Neal Richardson commented on ARROW-8748:
----------------------------------------
We could add methods to concatenate Tables in Arrow memory (the function
probably exists in the C++ library). But I'm not sure that's the best solution
to your problem. If you have several Tables and you dump them to a file, you
don't need to concatenate them in memory first. You can use the lower-level
{{RecordBatchStreamWriter}} that {{write_ipc_stream}} wraps. Something like:
{code:r}
file_obj <- FileOutputStream$create(file_name)
writer <- RecordBatchFileWriter$create(file_obj, batch$schema)
for (batch in batches) {
writer$write(batch)
}
writer$close()
file_obj$close()
{code}
See {{?RecordBatchWriter}}.
> [R] Implementing methodes for combining arrow tabels using dplyr::bind_rows
> and dplyr::bind_cols
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-8748
> URL: https://issues.apache.org/jira/browse/ARROW-8748
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Dominic Dennenmoser
> Priority: Major
> Labels: features, performance, pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> First at all, many thanks for your hard work! I was quite exited, when you
> guys implemented some basic function of the the {{dplyr}} package. Is there a
> why to combine tow or more arrow tables into one by rows or columns? At the
> moment my workaround looks like this:
> {code:r}
> dplyr::bind_rows(
> "a" = arrow.table.1 %>% dplyr::collect(),
> "b" = arrow.table.2 %>% dplyr::collect(),
> "c" = arrow.table.3 %>% dplyr::collect(),
> "d" = arrow.table.4 %>% dplyr::collect(),
> .id = "ID"
> ) %>%
> arrow::write_ipc_stream(sink = "file_name_combined_tables.arrow")
> {code}
> But this is actually not really a meaningful measure because of putting the
> data back as dataframes/tibbles into the r environment, which might lead to
> an exhaust of RAM space. Perhaps you might have a better workaround on hand.
> It would be great if you guys could implement the {{bind_rows}} and
> {{bind_cols}} methods provided by {{dplyr}}.
> {code:java}
> dplyr::bind_rows(
> "a" = arrow.table.1,
> "b" = arrow.table.2,
> "c" = arrow.table.3,
> "d" = arrow.table.4,
> .id = "ID"
> ) %>%
> arrow::write_ipc_stream(sink = "file_name_combined_tables.arrow"){code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)