On Mon, Mar 21, 2022 at 12:20:41PM -0700, Will Jones wrote: > > I've created a Jira issue to track rbind implementation in R: > https://issues.apache.org/jira/browse/ARROW-15989
That sounds great, thanks! On Mon, Mar 21, 2022 at 12:15 PM Will Jones <will.jones...@gmail.com> wrote: > I don't think we've implemented rbind yet, unfortunately. We've just > implemented concat_arrays (also bound to c()) [1], and that will be > available in the next release (or nightlies right now). Hm, so the lastest R arrow code supports using c() to concatenate Arrays, but not yet Tables? Does that mean I could use c() to take one column from each of my two feather files, and transparently make one big column? But I can NOT use c() to do the same for two entire Tables yet? (I'm new to Arrow, and haven't read the C++ code at all yet, so I'm pretty vague about the differences between Arrow arrays, tables, datasets, etc...) > The one way you could "rbind" multiple feather files, if they have the > same schema, is by constructing a union dataset out of the two or more > files. This would look something like this: > > > ds1 <- arrow::open_dataset("file1.feather", format="feather") > > ds2 <- arrow::open_dataset("file2.feather", format="feather") > > ds <- c(ds1, ds2) > > my_table <- collect(ds) Interesting. So the UnionDataset created by c(ds1, ds2) is giving me ALL rows from both Feather files, sort of like an implicit rbind(). The dplyr::collect() converts to a data frame of course. But the advantage of using the UnionDataset here, is that if I can filter the rows I need using only the dplyr verbs that Arrow supports, then I can do that BEFORE calling collect(), and delay converting from the mmap-ed Arrow format until later on? Are there any limitations of UnionDatasets that make this approach any worse than if rbind() on Arrow Tables already worked? -- Andrew Piskorski <a...@piskorski.com>