bascheibler commented on issue #1283: URL: https://github.com/apache/arrow-adbc/issues/1283#issuecomment-1809155815
> The ADBC driver tries to buffer the parts of the dataset concurrently. You could try setting the options to limit the queue size and concurrency to cut down on memory usage. (We could/should also probably limit the overall buffer size based on memory usage, I suspect.) https://arrow.apache.org/adbc/current/driver/snowflake.html#performance These `AdbcStatement` options really do seem to indicate it's a performance issue. I've set `prefetch_concurrency` to 1 and it significantly increased the number of tables that fail (~80%). Before that, only approx. 25% were failing. If I set this param to 75 and `result_queue_size` to 500, on the other side, the error rate is reduced to between 5% - 10% (I'll keep playing with these parameters to check if I can get to 0%). The refactored code is detailed below. I had to use a lower-level code to be able to set the statement options. ``` def export_table_low_level(schema_name, table_name): logging.debug(f"Starting download of {schema_name}.{table_name}") query = f"select * from {schema_name}.{table_name}" with adbc_driver_snowflake.connect( uri = snowflake_uri, db_kwargs = { "adbc.snowflake.sql.client_option.use_high_precision": "false" } ) as db: with adbc_driver_manager.AdbcConnection(db) as conn: with adbc_driver_manager.AdbcStatement(conn) as stmt: stmt.set_options( **{ adbc_driver_snowflake.StatementOptions.PREFETCH_CONCURRENCY.value: "1" } ) stmt.set_sql_query(query) stream, _ = stmt.execute_query() reader = pyarrow.RecordBatchReader._import_from_c(stream.address) Table = reader.read_all() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
