Re: Merge multiple record batches

Bryan Cutler Mon, 19 Feb 2018 14:05:28 -0800

Hi Rares,

I'm not sure what version of Arrow you are using, but pyarrow.Table has a
function to concat multiple tables together so the usage would be something
like this:


table_all = pa.Table.concat_tables([table1, table2])

On Wed, Feb 14, 2018 at 4:01 AM, ALBERTO Bocchinfuso <
[email protected]> wrote:

> Hi,
> I don’t think I understood perfectly your point, but I try to give you the
> answer that looks the simplest to me.
> In your code there isn’t any operation on table 1 and 2 separately, it
> just looks like you want to merge all those RecordBatches.
> Now I think that:
>
>   1.  you can use the to_batches() operation reported in the API for
> Table, but I never tried it myself. In this way you create 2 tables, create
> batches from these tables, put the batches togheter.
>   2.  I would rather store ALL the BATCHES in the two streams in the SAME
> python LIST, and then create an unique table using from_batches() as you
> suggested. That’s because in your code you create two tables even though
> you don’t seem to care about them.
>
> I didn’t try, but I think that you can go both ways and then tell us if
> the result is the same and if one of the two is faster then the other.
>
> Alberto
>
> Da: Rares Vernica<mailto:[email protected]>
> Inviato: mercoledì 14 febbraio 2018 05:13
> A: [email protected]<mailto:[email protected]>
> Oggetto: Merge multiple record batches
>
> Hi,
>
> If I have multiple RecordBatchStreamReader inputs, what is the recommended
> way to get all the RecordBatch from all the inputs together, maybe in a
> Table? They all have the same schema. The source for the readers are
> different files.
>
> So, I do something like:
>
> reader1 = pa.open_stream('foo')
> table1 = reader1.read_all()
>
> reader2 = pa.open_stream('bar')
> table2 = reader2.read_all()
>
> # table_all = ???
> # OR maybe I don't need to create table1 and table2
> # table_all = pa.Table.from_batches( ??? )
>
> Thanks!
> Rares
>
>

Re: Merge multiple record batches

Reply via email to