Hi Weston (and dev group), Speaking of the grouper in the C++ library and writing partitioned data, I had a tangential question if I may. I noticed in the example C++ source that an Arrow table, then in-memory dataset were created first, followed by a writing of the data to a partitioned data store. I observed two elements in that example code:
1) You needed to create an Arrow table/in-memory dataset to write partitioned data 2) The writing took place only after data in the table had been written in its entirety. The first of those two seemed straightforward. The Arrow library will write Arrow tables to disk, so you must have an Arrow table to begin. It's about the second item I'd like to inquire further. The short version of my question is simply: is it only possible to write the table one time, specifically after the table has been created completely (ergo "one and done")? Or can you write the data iteratively in smaller quantities over successive writes? I ask because we have a customer use case in which they want to write partitioned Parquet data, but the data could be (and likely will be) rather large. So if you must write only once, after the table has been created in its entirety, that could lead to a large memory footprint in that use case. Thanks! -----Original Message----- From: Weston Pace <weston.p...@gmail.com> Sent: Tuesday, June 13, 2023 2:11 PM To: dev@arrow.apache.org Subject: Re: Group rows in a stream of record batches by group id? EXTERNAL Are you looking for something in C++ or python? We have a thing called the "grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which (if memory serves) is the heart of the functionality in C++. It would be nice to add some python bindings for this functionality as this ask comes up from pyarrow users pretty regularly. The grouper is used in src/arrow/dataset/partition.h to partition a record batch into groups of batches. This is how the dataset writer writes a partitioned dataset. It's a good example of how you would use the grouper for a "one batch in, one batch per group out" use case. The grouper can also be used in a streaming situation (many batches in, one batch per group out). In fact, the grouper is what is used by the group by node. I know you recently added [1] and I'm maybe a little uncertain what the difference is between this ask and the capabilities added in [1]. [1] https://github.com/apache/arrow/pull/35514 On Tue, Jun 13, 2023 at 8:23 AM Li Jin <ice.xell...@gmail.com> wrote: > Hi, > > I am trying to write a function that takes a stream of record batches > (where the last column is group id), and produces k record batches, > where record batches k_i contain all the rows with group id == i. > > Pseudocode is sth like: > > def group_rows(batches, k) -> array[RecordBatch] { > builders = array[RecordBatchBuilder](k) > for batch in batches: > # Assuming last column is the group id > group_ids = batch.column(-1) > for i in batch.num_rows(): > k_i = group_ids[i] > builders[k_i].append(batch[i]) > > batches = array[RecordBatch](k) > for i in range(k): > batches[i] = builders[i].build() > return batches > } > > I wonder if there is some existing code that does something like this? > (Specially I didn't find code that can append row/rows to a > RecordBatchBuilder (either one row given an row index, or multiple > rows given a list of row indices) >