westonpace opened a new pull request, #13582: URL: https://github.com/apache/datafusion/pull/13582
## Which issue does this PR close? Closes #10339 ## Rationale for this change Many catalogs are remote (and/or disk based) and offer only asynchronous APIs. For example, [Polaris](https://github.com/apache/polaris), [Unity](https://github.com/unitycatalog/unitycatalog), and [Hive](https://hive.apache.org/). Integrating with this catalogs is impossible since something like `ctx.sql("SELECT * FROM db.schm.tbl")` first enters an async context (`sql`) then a synchronous context (calling the catalog provider to resolve `db`) and then we need to go into an asynchronous context to interact with the catalog and this async -> sync -> async path is generally forbidden. ## What changes are included in this PR? The heart of the change is making all non-trivial methods async in `CatalogProvider`, `SchemaProvider`, and `CatalogProviderList`. These changes had rather far-reaching ramifications discussed more below. ## Are these changes tested? Yes, in the sense that these traits were all tested previously. I did not add any new testing. ## Are there any user-facing changes? Yes, there are significant user-facing changes beyond the obvious change that these public traits are now async. ### Notable but expected The following methods are now async and were previously sync ``` SessionContext::register_catalog SessionContext::catalog_names SessionContext::catalog SessionContext::register_table SessionContext::deregister_table SessionContext::table_exist ``` ### Perhaps surprising The `SessionStateBuilder::build` method and `SessionStateBuilder::new_from_existing` methods are now async and the `From` impls to go between `SessionState` and `SessionStateBuilder` were removed. The `new_from_existing` change is because the method does a (now async) lookup into the existing catalog list (which may be remote) to determine if a default catalog exists so that it knows if it needs to create a new one or not. The `build` change is because, if no default catalog exists, then a default catalog is created and registered with the (potentially remote) catalog list. The `SessionContext::register_batch` and `SessionContext::register_table` methods are used frequently and it may make sense to think of them as synchronous since in-memory tables and batches are not typically thought of as something a "catalog" provides. It would be possible for `SessionContext` to have both "a catalog" (which is async) and "a collection of in-memory session-specific tables" (which is sync) and thus keep these methods synchronous. However, that felt like more complexity than justified by this change. ### Caveats In most cases I simply propagated the async. In some benchmarks I am using `now_or_never().expect()` to avoid the need to create a tokio runtime. In a few of the `SessionContext` factory functions, where a custom catalog cannot possibly be provided (e.g. `SessionContext::default`) I use `now_or_never` since it is safe to assume the default in-memory catalog is synchronous. ### Details for review The only non-trivial implementation changes were in `SessionState`. There is a `RwLock` there and it required some fiddling (and a few Arc clones of the catalog list) to avoid holding the lock across an await boundary. This could be potentially simplified by changing the `RwLock` to an async lock but I'm not sure that's justified yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org