I am working on some improvements to bulk ingestion for the Snowflake ADBC
driver[1] and have been investigating existing implementations in related
libraries.

The current driver implementation defers to the gosnowflake library
to handle this. In Snowflake's implementation, uploads are buffered into
chunks no more than 10 MB in size before sending across the network. They
claim this limit comes from the JDBC specification[2]. I wasn't able to
find documentation for this limit, but it got me thinking about potential
assumptions consumers of ADBC might have regarding resources that a driver
would utilize. Should our implementation respect this 10 MB limit? If not,
is there any specific limit we should target?

Similarly, are there any expectations regarding storage usage? The
snowflake-connector-python write_pandas() implementation uses a different
approach, saving the dataframe to parquet files in a temp directory and
then uploading them[3]. We likely don't want to save all data to disk
before uploading given the size of data this API is intended to handle, but
even a chunked implementation could produce large files.

Is there a set of guidelines on limits for memory, storage, or other
resource usage for ADBC drivers? If not, should there be?

Thanks,
Joel Lubinitsky

[1] https://github.com/apache/arrow-adbc/issues/1327
[2]
https://github.com/snowflakedb/gosnowflake/blob/master/bind_uploader.go#L21
[3]
https://github.com/snowflakedb/snowflake-connector-python/blob/main/src/snowflake/connector/pandas_tools.py#L168

Reply via email to