Re: Group rows in a stream of record batches by group id?

Haocheng Liu Wed, 14 Jun 2023 07:33:13 -0700

Hi Jerry,

 I asked similar questions on how to "write the data iteratively in smaller
quantities over successive writes?" as hive partitioned parquet months ago
and the reply from Weston was extremely helpful to me. Here are the related
threads on how to use acero
<https://arrow.apache.org/docs/cpp/streaming_execution.html>.


Write data in a streaming fashion if you already have data stored in
certain format - emailing thread link
<https://lists.apache.org/thread/f0ph32xwgn8smo9yx5b7hwrrj90wtbob>

Write data in a streaming fashion if the incoming data uses a push model
and you receive them row by row - emailing thread link
<https://lists.apache.org/thread/cp2b5jty83q3otjgp2x905scgzg3opnd>

Hope it helps.


On Wed, Jun 14, 2023 at 9:01 AM Jerry Adair <jerry.ad...@sas.com.invalid>
wrote:

> Hi Weston (and dev group),
>
> Speaking of the grouper in the C++ library and writing partitioned data, I
> had a tangential question if I may.  I noticed in the example C++ source
> that an Arrow table, then in-memory dataset were created first, followed by
> a writing of the data to a partitioned data store.  I observed two elements
> in that example code:
>
> 1) You needed to create an Arrow table/in-memory dataset to write
> partitioned data
>
> 2) The writing took place only after data in the table had been written in
> its entirety.
>
> The first of those two seemed straightforward.  The Arrow library will
> write Arrow tables to disk, so you must have an Arrow table to begin.  It's
> about the second item I'd like to inquire further.  The short version of my
> question is simply: is it only possible to write the table one time,
> specifically after the table has been created completely (ergo "one and
> done")?  Or can you write the data iteratively in smaller quantities over
> successive writes?  I ask because we have a customer use case in which they
> want to write partitioned Parquet data, but the data could be (and likely
> will be) rather large.  So if you must write only once, after the table has
> been created in its entirety, that could lead to a large memory footprint
> in that use case.
>
> Thanks!
>
>
> -----Original Message-----
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Tuesday, June 13, 2023 2:11 PM
> To: dev@arrow.apache.org
> Subject: Re: Group rows in a stream of record batches by group id?
>
> EXTERNAL
>
> Are you looking for something in C++ or python?  We have a thing called
> the "grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h)
> which (if memory serves) is the heart of the functionality in C++.  It
> would be nice to add some python bindings for this functionality as this
> ask comes up from pyarrow users pretty regularly.
>
> The grouper is used in src/arrow/dataset/partition.h to partition a record
> batch into groups of batches.  This is how the dataset writer writes a
> partitioned dataset.  It's a good example of how you would use the grouper
> for a "one batch in, one batch per group out" use case.
>
> The grouper can also be used in a streaming situation (many batches in,
> one batch per group out).  In fact, the grouper is what is used by the
> group by node.  I know you recently added [1] and I'm maybe a little
> uncertain what the difference is between this ask and the capabilities
> added in [1].
>
> [1] https://github.com/apache/arrow/pull/35514
>
> On Tue, Jun 13, 2023 at 8:23 AM Li Jin <ice.xell...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying to write a function that takes a stream of record batches
> > (where the last column is group id), and produces k record batches,
> > where record batches k_i contain all the rows with group id == i.
> >
> > Pseudocode is sth like:
> >
> > def group_rows(batches, k) -> array[RecordBatch] {
> >       builders = array[RecordBatchBuilder](k)
> >       for batch in batches:
> >            # Assuming last column is the group id
> >            group_ids = batch.column(-1)
> >            for i in batch.num_rows():
> >                 k_i = group_ids[i]
> >                 builders[k_i].append(batch[i])
> >
> >        batches = array[RecordBatch](k)
> >        for i in range(k):
> >            batches[i] = builders[i].build()
> >        return batches
> > }
> >
> > I wonder if there is some existing code that does something like this?
> > (Specially I didn't find code that can append row/rows to a
> > RecordBatchBuilder (either one row given an row index, or multiple
> > rows given a list of row indices)
> >
>


-- 
Best regards

Re: Group rows in a stream of record batches by group id?

Reply via email to