RE: Group rows in a stream of record batches by group id?

Jerry Adair Wed, 14 Jun 2023 06:00:57 -0700

Hi Weston (and dev group),

Speaking of the grouper in the C++ library and writing partitioned data, I had 
a tangential question if I may.  I noticed in the example C++ source that an 
Arrow table, then in-memory dataset were created first, followed by a writing 
of the data to a partitioned data store.  I observed two elements in that 
example code:

1) You needed to create an Arrow table/in-memory dataset to write partitioned 
data

2) The writing took place only after data in the table had been written in its 
entirety.

The first of those two seemed straightforward.  The Arrow library will write 
Arrow tables to disk, so you must have an Arrow table to begin.  It's about the 
second item I'd like to inquire further.  The short version of my question is 
simply: is it only possible to write the table one time, specifically after the 
table has been created completely (ergo "one and done")?  Or can you write the 
data iteratively in smaller quantities over successive writes?  I ask because 
we have a customer use case in which they want to write partitioned Parquet 
data, but the data could be (and likely will be) rather large.  So if you must 
write only once, after the table has been created in its entirety, that could 
lead to a large memory footprint in that use case.

Thanks!

-----Original Message-----
From: Weston Pace <weston.p...@gmail.com>
Sent: Tuesday, June 13, 2023 2:11 PM
To: dev@arrow.apache.org
Subject: Re: Group rows in a stream of record batches by group id?

EXTERNAL

Are you looking for something in C++ or python?  We have a thing called the 
"grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which (if 
memory serves) is the heart of the functionality in C++.  It would be nice to 
add some python bindings for this functionality as this ask comes up from 
pyarrow users pretty regularly.

The grouper is used in src/arrow/dataset/partition.h to partition a record 
batch into groups of batches.  This is how the dataset writer writes a 
partitioned dataset.  It's a good example of how you would use the grouper for 
a "one batch in, one batch per group out" use case.

The grouper can also be used in a streaming situation (many batches in, one 
batch per group out).  In fact, the grouper is what is used by the group by 
node.  I know you recently added [1] and I'm maybe a little uncertain what the 
difference is between this ask and the capabilities added in [1].

[1] https://github.com/apache/arrow/pull/35514

On Tue, Jun 13, 2023 at 8:23 AM Li Jin <ice.xell...@gmail.com> wrote:

> Hi,
>
> I am trying to write a function that takes a stream of record batches
> (where the last column is group id), and produces k record batches,
> where record batches k_i contain all the rows with group id == i.
>
> Pseudocode is sth like:
>
> def group_rows(batches, k) -> array[RecordBatch] {
>       builders = array[RecordBatchBuilder](k)
>       for batch in batches:
>            # Assuming last column is the group id
>            group_ids = batch.column(-1)
>            for i in batch.num_rows():
>                 k_i = group_ids[i]
>                 builders[k_i].append(batch[i])
>
>        batches = array[RecordBatch](k)
>        for i in range(k):
>            batches[i] = builders[i].build()
>        return batches
> }
>
> I wonder if there is some existing code that does something like this?
> (Specially I didn't find code that can append row/rows to a
> RecordBatchBuilder (either one row given an row index, or multiple
> rows given a list of row indices)
>

RE: Group rows in a stream of record batches by group id?

Reply via email to