RE: Is there anyway to resize record batches

Lee, David (PAG) Mon, 27 Nov 2023 11:40:00 -0800

I am using Snowflake's bulk ingestion.. It works fine.. The issue is that the 
size of the BULK insert is determined by the ADBC go code and it sends array 
bind inserts that correlate with arrow record batches..


Unfortunately the arrow record batches are not optimized for bulk insert.

When reading a bunch of CSV files I end up with 3,000 record batches which 
range from 12k to 300k records each. The CSV reader is 
multi-threaded/multi-core so the batch sizes are determined by the number of 
files / size of files being read in parallel.

This took me over 3 hours to insert this 36 million row arrow table into 
Snowflake using adbc_ingest().

If I re-org these 3,000 record batches in the table into 36 record batches of 1 
million records each.. adbc_ingest() only takes 3 minutes..

I opened up an issue with sample code I used to optimize the size of bind 
inserts sent to Snowlake..
https://github.com/apache/arrow-adbc/issues/1322

In the snowflake history log I can see inserts happening in bulk in 1 million 
row chunks..

CREATE OR REPLACE TEMPORARY STAGE SYSTEM$BIND file_format=(type=csv 
field_optionally_enclosed_by='"')
136ms
Go 1.6.22
 
INSERT INTO xyz VALUES (?, ?, ?, ?, ?)
2.9s
82.6MB
Go 1.6.22
1.0M

Before the re-org it would look something like 12K rows..

CREATE OR REPLACE TEMPORARY STAGE SYSTEM$BIND file_format=(type=csv 
field_optionally_enclosed_by='"')
136ms
Go 1.6.22
 
INSERT INTO xyz VALUES (?, ?, ?, ?, ?)
2.9s
82.6MB
Go 1.6.22
12K

Those chunks just happen to match the size of record batches..

record_batches = my_data.to_batches()
record_batches[0].num_rows
>> 12K
record_batches[1].num_rows
>> 13K
record_batches[2].num_rows
>> 112K

-----Original Message-----
From: David Li <lidav...@apache.org> 
Sent: Monday, November 27, 2023 8:20 AM
To: dev@arrow.apache.org
Subject: Re: Is there anyway to resize record batches

External Email: Use caution with links and attachments


Following up here, Dewey pointed out that the "right" way to do this would be 
to use Snowflake's own bulk ingestion support. I filed 
https://urldefense.com/v3/__https://github.com/apache/arrow-adbc/issues/1327__;!!KSjYCgUGsB4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9NBCkzl4_QpLOxFEqVZi$

On Wed, Nov 22, 2023, at 16:06, Lee, David (PAG) wrote:
> I got this working by re-organizing vectors into 1 million row each.
>
> My Snowflake bulk insert now takes 3 minutes vs 3 hours.. I'll open a 
> ticket in ADBC to improve the interface..
>
> ADBC's adbc_ingest() function needs something similar to 
> https://urldefense.com/v3/__https://arrow.apache.org/docs/python/gener
> ated/pyarrow.dataset.write_dataset.html__;!!KSjYCgUGsB4!b8L9aXuurzrqli
> -oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9NBCkzl4_QpLOwwm-MjC
> $
> which supports
> max_partitions=None, max_open_files=None, max_rows_per_file=None, 
> min_rows_per_group=None, max_rows_per_group=None,
>
> new_recordbatch = {column: [] for column in my_data.schema.names} 
> new_batches = [] rows = 0
>
> for batch in my_data.to_batches():
>     for column in my_data.schema.names:
>         new_recordbatch[column].append(batch[column])
>     if rows < 1000000:
>         rows += batch.num_rows
>     else:
>         print(rows)
>         new_arrays = []
>         for column in my_data.schema.names:
>             new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
>         new_batches.append(pa.RecordBatch.from_arrays(new_arrays,
> schema=my_data.schema))
>         new_recordbatch = {column: [] for column in 
> my_data.schema.names}
>         rows = 0
>
> if rows > 0:
>     new_arrays = []
>     for column in my_data.schema.names:
>         new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
>     new_batches.append(pa.RecordBatch.from_arrays(new_arrays,
> schema=my_data.schema))
>
> new_table = pa.Table.from_batches(new_batches)
>
> cursor = adbc_conn.cursor()
> cursor.adbc_ingest(table_name="xyz", data=new_table, mode="append")
> cursor.execute("commit")
>
> -----Original Message-----
> From: Aldrin <octalene....@pm.me.INVALID>
> Sent: Wednesday, November 22, 2023 12:36 PM
> To: dev@arrow.apache.org
> Subject: Re: Is there anyway to resize record batches
>
> As far as I understand, that bundles the Arrays into a ChunkedArray 
> which only Table interacts with. It doesn't make a longer Array and 
> depending on what the ADBC Snowflake driver is doing that may or may 
> not help with the number of invocations that are happening.
>
> Also, its not portable across implementations since ChunkedArray is 
> not part of the specification, though I am optimistic that if you pass 
> ChunkedArray to a different implementation then the C++ implementation 
> could consolidate it as a single Array.
>
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://urldefense.com/v3/__https://github.com/drin/__;!!KSjYCgUGsB4!b
> 8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9NBCkzl
> 4_QpLO8S25PZ4$
>
> https://urldefense.com/v3/__https://gitlab.com/octalene__;!!KSjYCgUGsB
> 4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9NBC
> kzl4_QpLO5sIQfFf$
>
> https://urldefense.com/v3/__https://keybase.io/octalene__;!!KSjYCgUGsB
> 4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9NBC
> kzl4_QpLO1-ecbrH$
>
>
> On Wednesday, November 22nd, 2023 at 11:58, Jacek Pliszka 
> <jacek.plis...@gmail.com> wrote:
>
>
>> Re 4. you create ChunkedArray from Array.
>>
>
>> BR
>>
>
>> J
>>
>
>> śr., 22 lis 2023 o 20:48 Aldrin octalene....@pm.me.invalid napisał(a):
>>
>
>> > Assuming the C++ implementation, Jacek's suggestion (#3 below) is 
>> > probably best. Here is some extra context:
>> >
>
>> > 1. You can slice larger RecordBatches 1 2. You can make a larger 
>> > RecordBatch 2 from columns of smaller RecordBatches 3 probably 
>> > using the correct type of Builder 4 and with a bit of resistance 
>> > from the various types 3. As Jacek said, you can wrap smaller 
>> > RecordBatches together as a Table 5, combine the chunks 6, and then 
>> > convert back to RecordBatches using a TableBatchReader 7 if 
>> > necessary 4. I didn't see anything useful in the Compute API for 
>> > concatenating arbitrary Arrays or RecordBatches, but you can use 
>> > Selection functions 8 instead of Slicing for anything that's too big.
>> >
>
>> > # ------------------------------
>> >
>
>> > # Aldrin
>> >
>
>> > https://urldefense.com/v3/__https://github.com/drin/__;!!KSjYCgUGsB
>> > 4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ0j9
>> > NBCkzl4_QpLO8S25PZ4$
>> >
>
>> > https://urldefense.com/v3/__https://gitlab.com/octalene__;!!KSjYCgU
>> > GsB4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ
>> > 0j9NBCkzl4_QpLO5sIQfFf$
>> >
>
>> > https://urldefense.com/v3/__https://keybase.io/octalene__;!!KSjYCgU
>> > GsB4!b8L9aXuurzrqli-oMGNXc61QctV8GU_TPYwAFf_z4cnjtoBnwx7N_bv3me0uMZ
>> > 0j9NBCkzl4_QpLO1-ecbrH$
>> >
>
>> > On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka < 
>> > jacek.plis...@gmail.com> wrote:
>> >
>
>> > > Hi!
>> >
>
>> > > I think some code is needed for clarity. You can concatenate 
>> > > tables (and combine_chunks afterwards) or arrays. Then pass such 
>> > > concatenated one.
>> >
>
>> > > Regards,
>> >
>
>> > > Jacek
>> >
>
>> > > śr., 22 lis 2023 o 19:54 Lee, David (PAG) david....@blackrock.com 
>> > > .invalid
>> >
>
>> > > napisał(a):
>> >
>
>> > > > I've got 36 million rows of data which ends up as a record 
>> > > > batch with
>> > > > 3000
>> > > > batches ranging from 12k to 300k rows each. I'm assuming these 
>> > > > batches are created using the multithreaded CSV file reader..
>> >
>
>> > > > Is there anyway to reorg the data into sometime like 36 batches 
>> > > > consistent of 1 million rows each?
>> >
>
>> > > > What I'm seeing when we try to load this data using the ADBC 
>> > > > Snowflake driver is that each individual batch is executed as a 
>> > > > bind array insert in the Snowflake Go Driver.
>> > > > 3,000 bind array inserts is taking 3 hours..
>> >
>
>> > > > This message may contain information that is confidential or 
>> > > > privileged.
>> > > > If you are not the intended recipient, please advise the sender 
>> > > > immediately and delete this message. See 
>> > > > http://www.blackrock.com/corporate/compliance/email-disclaimers
>> > > > for further information. Please refer to 
>> > > > http://www.blackrock.com/corporate/compliance/privacy-policy 
>> > > > for more information about BlackRock’s Privacy Policy.
>> >
>
>> > > > For a list of BlackRock's office addresses worldwide, see 
>> > > > http://www.blackrock.com/corporate/about-us/contacts-locations.
>> >
>
>> > > > © 2023 BlackRock, Inc. All rights reserved.

RE: Is there anyway to resize record batches

Reply via email to