On Mon, Mar 21, 2022 at 12:20:41PM -0700, Will Jones wrote:
> 
> I've created a Jira issue to track rbind implementation in R:
> https://issues.apache.org/jira/browse/ARROW-15989

That sounds great, thanks!

On Mon, Mar 21, 2022 at 12:15 PM Will Jones <will.jones...@gmail.com> wrote:

> I don't think we've implemented rbind yet, unfortunately. We've just
> implemented concat_arrays (also bound to c()) [1], and that will be
> available in the next release (or nightlies right now).

Hm, so the lastest R arrow code supports using c() to concatenate
Arrays, but not yet Tables?  Does that mean I could use c() to take
one column from each of my two feather files, and transparently make
one big column?  But I can NOT use c() to do the same for two entire
Tables yet?

(I'm new to Arrow, and haven't read the C++ code at all yet, so I'm
pretty vague about the differences between Arrow arrays, tables,
datasets, etc...)

> The one way you could "rbind" multiple feather files, if they have the
> same schema, is by constructing a union dataset out of the two or more
> files. This would look something like this:
>
> > ds1 <- arrow::open_dataset("file1.feather", format="feather")
> > ds2 <- arrow::open_dataset("file2.feather", format="feather")
> > ds <- c(ds1, ds2)
> > my_table <- collect(ds)

Interesting.  So the UnionDataset created by c(ds1, ds2) is giving me
ALL rows from both Feather files, sort of like an implicit rbind().

The dplyr::collect() converts to a data frame of course.  But the
advantage of using the UnionDataset here, is that if I can filter the
rows I need using only the dplyr verbs that Arrow supports, then I can
do that BEFORE calling collect(), and delay converting from the mmap-ed
Arrow format until later on?

Are there any limitations of UnionDatasets that make this approach any
worse than if rbind() on Arrow Tables already worked?

-- 
Andrew Piskorski <a...@piskorski.com>

Reply via email to