flightsql: Default Value (10 MB) For adbc.snowflake.rpc.ingest_target_file_size Not Used In 1.1.0 [arrow-adbc]

via GitHub Wed, 10 Jul 2024 12:24:12 -0700


Zan-L commented on issue #1997:
URL: https://github.com/apache/arrow-adbc/issues/1997#issuecomment-2221262572


   @zeroshade I can confirm something a bit different than what you asked in 2. 
but more useful - the same data would be split correctly into parquet files of 
~10 MB in ADBC 1.0.0, but only 4 in 1.1.0 in a VM of 4 cores. That should rule 
out the cause from the data side. 
   To answer your specific question, the `data` object as you can see from the 
code below is usually a result of `DeltaTable(path).to_pyarrow_dataset()`. We 
have 180 tables, some of which could contain large batches, but definitely not 
all of them. And the behavior of splitting into only four parquet files in a 
4-core VM is universal.
   
   @joellubi I actually wanted to bring that up but held back to avoid 
complexity. First of all, no - I did not provide ad-hoc parameters. However, I 
did try to do so in an attempt to fix the bug myself but failed. Here are the 
two ways I tried:
   
   1.
   ```python
   def _arrow_to_snowflake(table: str, conn_uri: str, data: ds.Dataset | 
pa.RecordBatchReader, mode: Literal['append', 'create', 'replace', 
'create_append']):
       with adbc_driver_snowflake.dbapi.connect(
           conn_uri, 
           autocommit=True, 
           db_kwargs={'adbc.snowflake.rpc.ingest_target_file_size': str(2**14)}
       ) as conn, conn.cursor() as cursor:
           cursor.adbc_ingest(table, data, mode)
   ```
   In this setting, data are still split into 4 parquets but a small number of 
tables made it to Snowflake while the rest still failed due to OOM. So either 
this way of setting the parameter never works (it is set but not picked up by 
the code) or it interfered in certain ways (so a small number of jobs could 
succeed) but did not stop the final data from being split into the number of 
processors.
   
   2.
   ```python
   def _arrow_to_snowflake(table: str, conn_uri: str, data: ds.Dataset | 
pa.RecordBatchReader, mode: Literal['append', 'create', 'replace', 
'create_append']):
       with adbc_driver_snowflake.dbapi.connect(conn_uri, autocommit=True) as 
conn, conn.cursor() as cursor:
           
cursor.adbc_statement.set_options(**{'adbc.snowflake.rpc.ingest_target_file_size':
 str(2**14)})
           cursor.adbc_ingest(table, data, mode)
   ```
   It is actually the recommended way to tune using the AdbcStatement but it 
raises a NotImplemented error at set_options(), and thus contradicting [the 
doc](https://arrow.apache.org/adbc/current/driver/snowflake.html#bulk-ingestion).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] go/adbc/driver/flightsql: Default Value (10 MB) For adbc.snowflake.rpc.ingest_target_file_size Not Used In 1.1.0 [arrow-adbc]

Reply via email to