ryan-syed commented on issue #1332: URL: https://github.com/apache/arrow-adbc/issues/1332#issuecomment-1846191163
To elaborate when a connection is being made using the current driver without a database name, no `USE Database` call is made and therefore any call to INFORMATION_SCHEMA needs to be prefixed with the database name. Without the prefix as implemented currently they will fail with the following error: `Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified name` We were able to reduce the time it takes for the `GetObjects` call by removing the cursor implementation, however, this inadvertently relies on the fact that a initial database name is provided so that `INFORMATION_SCHEMA` calls can be made. There is also a bug, where `INFORMATION_SCHEMA.Tables` without the db name prefixed will only return tables in the intial database name that is set. Same is the case for `INFORMATION_SCHEMA.Columns`. I was looking at adding the cursor back to have `INFORMATION_SCHEMA` calls with the db name prefix and it would slow the calls down, however, there are still other optimizations that can be made as discussed in the [proposal](##Proposal) below. ## A bit about the current GetObjects API: It fetches the catalogs, schemas, tables, table constraints (currently not implemented for Snowflake Go Driver), and columns. It also does this by making the following calls: | SQL | Notes | Depth | |-----|---------|--------| |[SHOW TERSE DATABASES](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L247) | This is used to get the matching _**catalogNames**_ and the driver does the filtering using [PatternsToRegexp](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/internal/shared_utils.go#L46) and [catalogPatternMatch](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/internal/shared_utils.go#L209C12-L209C12) | `ObjectDepthCatalogs` | |[SELECT CATALOG_NAME, SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L295C2-L295C2) | This is used to get the **_catalogName_** and corresponding **_schemaNames_** | `ObjectDepthDBSchemas` | | [SELECT table_catalog, table_schema, table_name, table_type FROM INFORMATION_SCHEMA.TABLES](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L509C12-L509C101) | This is used to get the tableName and tableType for the corresponding catalogs, and schemas. | `ObjectDepthTables` | | [SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L571C6-L577C63) | This is used to get the columns metadata. | `ObjectDepthAll` or `ObjectDepthColumns` | and missing implementation: | SQL | Notes | |-----|---------| | `SELECT * FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS` | This will be used to get the [table constraints](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L177) like primary key and foreign key | ## Constrast to ODBC: The following analysis is based on [SIMBA ODBC Snowflake driver](https://docs.snowflake.com/en/developer-guide/odbc/odbc-download), [ODBC Test](https://learn.microsoft.com/en-us/sql/odbc/odbc-test?view=sql-server-ver16) and QueryHistory for client generated calls in Snowflake. I believe the `GetObjects` call is equivalent to the following ODBC APIs: | ODBC API | SQL | Notes | Equivalent ADBC | |-----|---------|------------|--------------------| | [SQLTables](https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqltables-function?view=sql-server-ver16) | `show objects /* ODBC:TableMetadataSource */in account` | Gets the names, catalogName, schemaName, kind (VIEW or TABLE), comment, and filters within the driver for the given patterns for catalog, schema, and tables | `GetObjects` with `ObjectDepthTables` | | [SQLColumns](https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqlcolumns-function?view=sql-server-ver16) | `show /* ODBC:ColumnMetadataSource */ columns in database "foo_db"` or `show /* ODBC:ColumnMetadataSource */ columns in schema "foo_db"."bar_schema"` | The call was based on whether the schema name was a pattern | It is different from ADBC as catalogName for `SQLColumns` can't be a pattern. However, it is roughly equivalent to `GetObjects` with `ObjectDepthAll` or `ObjectDepthColumns` | | [SQLPrimaryKeys](https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqlprimarykeys-function?view=sql-server-ver16) | `show primary keys /* ODBC:PrimaryKeysMetadataSource */ in table "foo_db"."bar_schema"` | Gets the primary keys | `GetObjects` with `ObjectDepthTables`, _though not currently implemented for Snowflake Go Driver_. | | [SQLForeignKeys](https://learn.microsoft.com/en-us/sql/odbc/reference/syntax/sqlforeignkeys-function?view=sql-server-ver16) | `show exported keys /* ODBC:ForeignKeysMetadataSource */ in account` | Get the foreign keys | `GetObjects` with `ObjectDepthTables`, _though not currently implemented for Snowflake Go Driver_. | Therefore, the `GetObjects` API masquerades as a single API and performs the functionality of **_four_** ODBC APIs based on the depth provided. However, due to this if you wish to only get the foreign keys using the `GetObjects` you would end up making several other calls. ## Proposal: Instead of making four separate calls, the design could be changed to just making one SQL call that is for the highest depth and inferring all the other information for the previous depths from it. **Example:** For the worst case scenario with the depth as `ObjectDepthAll` or `ObjectDepthColumns` we can make only [SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/snowflake/connection.go#L571C6-L577C63) call and derive information for tables, schemas and catalogs based on the data from this single call. Now, to decide the underlying API, there are two ways we can go about this: ### Underlying API Proposal 1: Make `show` calls similar to ODBC Driver #### Pros: 1. `SHOW` calls are faster 2. The command does not require a running [warehouse to execute.](https://docs.snowflake.com/en/sql-reference/sql/show#:~:text=The%20command%20does%20not%20require%20a%20running%20warehouse%20to%20execute). Therefore probably no cost is incurred for metadata calls ### Cons: 1. we need to filter (patterns) and sort within the driver * filtering is already done though using [PatternsToRegexp](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/internal/shared_utils.go#L46) and [catalogPatternMatch](https://github.com/apache/arrow-adbc/blob/cf3b5c0ec6279d2c3f8d48328cfe6dd92e250116/go/adbc/driver/internal/shared_utils.go#L209C12-L209C12) 2. The command returns a maximum of 10K records for the specified object type, as dictated by the access privileges for the role used to execute the command; any records above the [10K](https://docs.snowflake.com/en/sql-reference/sql/show#:~:text=The%20command%20returns%20a%20maximum%20of%2010K%20records%20for%20the%20specified%20object%20type%2C%20as%20dictated%20by%20the%20access%20privileges%20for%20the%20role%20used%20to%20execute%20the%20command%3B%20any%20records%20above%20the%2010K%20limit%20are%20not%20returned%2C%20even%20with%20a%20filter%20applied.) limit are not returned, even with a filter applied. * this may become a limitation where there are way too many objects in the DB and something to be aware of. ### Underlying API Proposal 2: Continue with the `SELECT` calls ### Pros: 1. Existing implementation 2. No limit for the data returned * Not sure, why this is not an issue for the ODBC Driver (probably a known limitation) ### Cons: 1. `SELECT` calls are slower than `SHOW` calls and metadata fetch will be slower than ODBC. * In the current API spec anyways ADBC calls will be slower than ODBC because `GetObjects` is equivalent to four APIs in ODBC as previously mentioned. **_Please let me know your thoughts about the changes in design and which API would be better among `SHOW` and `SELECT` for metadata calls._** ## Additional Questions: 1. What is the use case for `GetObjects` API to be some dense in terms of data fetched? 2. Should we break it up to several APIs similar to ODBC? I notice another thread on it: https://github.com/apache/arrow-adbc/issues/621. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
