I am working on some improvements to bulk ingestion for the Snowflake ADBC driver[1] and have been investigating existing implementations in related libraries.
The current driver implementation defers to the gosnowflake library to handle this. In Snowflake's implementation, uploads are buffered into chunks no more than 10 MB in size before sending across the network. They claim this limit comes from the JDBC specification[2]. I wasn't able to find documentation for this limit, but it got me thinking about potential assumptions consumers of ADBC might have regarding resources that a driver would utilize. Should our implementation respect this 10 MB limit? If not, is there any specific limit we should target? Similarly, are there any expectations regarding storage usage? The snowflake-connector-python write_pandas() implementation uses a different approach, saving the dataframe to parquet files in a temp directory and then uploading them[3]. We likely don't want to save all data to disk before uploading given the size of data this API is intended to handle, but even a chunked implementation could produce large files. Is there a set of guidelines on limits for memory, storage, or other resource usage for ADBC drivers? If not, should there be? Thanks, Joel Lubinitsky [1] https://github.com/apache/arrow-adbc/issues/1327 [2] https://github.com/snowflakedb/gosnowflake/blob/master/bind_uploader.go#L21 [3] https://github.com/snowflakedb/snowflake-connector-python/blob/main/src/snowflake/connector/pandas_tools.py#L168