> Hm, so the latest R arrow code supports using c() to concatenate > Arrays, but not yet Tables? Does that mean I could use c() to take > one column from each of my two feather files, and transparently make > one big column? But I can NOT use c() to do the same for two entire > Tables yet?
Yup, exactly. Array/ChunkedArray are basically columns in a table. I suppose it wouldn't be too hard to make a quick rbind implementation out of that, simply by calling c() on each column and passing those results plus the original schema to arrow_table. > The dplyr::collect() converts to a data frame of course. If you want that as an arrow table instead of a data frame, you can pass in as_data_frame=FALSE into collect(). > Are there any limitations of UnionDatasets that make this approach any > worse than if rbind() on Arrow Tables already worked? AFAIK, not much. It's just very verbose, especially if you are simply trying to rbind() two arrow tables that you've already loaded in memory. On Tue, Mar 22, 2022 at 1:56 PM Andrew Piskorski <a...@piskorski.com> wrote: > On Mon, Mar 21, 2022 at 12:20:41PM -0700, Will Jones wrote: > > > > I've created a Jira issue to track rbind implementation in R: > > https://issues.apache.org/jira/browse/ARROW-15989 > > That sounds great, thanks! > > On Mon, Mar 21, 2022 at 12:15 PM Will Jones <will.jones...@gmail.com> > wrote: > > > I don't think we've implemented rbind yet, unfortunately. We've just > > implemented concat_arrays (also bound to c()) [1], and that will be > > available in the next release (or nightlies right now). > > Hm, so the lastest R arrow code supports using c() to concatenate > Arrays, but not yet Tables? Does that mean I could use c() to take > one column from each of my two feather files, and transparently make > one big column? But I can NOT use c() to do the same for two entire > Tables yet? > > (I'm new to Arrow, and haven't read the C++ code at all yet, so I'm > pretty vague about the differences between Arrow arrays, tables, > datasets, etc...) > > > The one way you could "rbind" multiple feather files, if they have the > > same schema, is by constructing a union dataset out of the two or more > > files. This would look something like this: > > > > > ds1 <- arrow::open_dataset("file1.feather", format="feather") > > > ds2 <- arrow::open_dataset("file2.feather", format="feather") > > > ds <- c(ds1, ds2) > > > my_table <- collect(ds) > > Interesting. So the UnionDataset created by c(ds1, ds2) is giving me > ALL rows from both Feather files, sort of like an implicit rbind(). > > The dplyr::collect() converts to a data frame of course. But the > advantage of using the UnionDataset here, is that if I can filter the > rows I need using only the dplyr verbs that Arrow supports, then I can > do that BEFORE calling collect(), and delay converting from the mmap-ed > Arrow format until later on? > > Are there any limitations of UnionDatasets that make this approach any > worse than if rbind() on Arrow Tables already worked? > > -- > Andrew Piskorski <a...@piskorski.com> >