Lately, I’ve been working on implementing the new SQL logical plans. I’m currently blocked working on the plans that require table metadata operations. For example, CTAS will be implemented as a create table and a write using DSv2 (and a drop table if anything goes wrong). That requires something to expose the create and drop table actions: a table catalog.
Initially, I opened #21306 <https://github.com/apache/spark/pull/21306> to get a table catalog from the data source, but that’s a bad idea because it conflicts with future multi-catalog support. Sources are an implementation of a read and write API that can be shared between catalogs. For example, you could have prod and test HMS catalogs that both use the Parquet source. The Parquet source shouldn’t determine whether a CTAS statement creates a table in prod or test. That means that CTAS and other plans for DataSourceV2 need a solution to determine the catalog to use. Proposal I propose we add support for multiple catalogs now in support of the DataSourceV2 work, to avoid hacky work-arounds. First, I think we need to add catalog to TableIdentifier so tables are identified by catalog.db.table, not just db.table. This would make it easy to specify the intended catalog for SQL statements, like CREATE cat.db.table AS ..., and in the DataFrame API: df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table"). Second, we will need an API for catalogs to implement. The SPIP on APIs for Table Metadata <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#> already proposed the API for create/alter/drop table operations. The only part that is missing is how to register catalogs instead of using DataSourceV2 to instantiate them. I think we should configure catalogs through Spark config properties, like this: spark.sql.catalog.<name> = <impl-class> spark.sql.catalog.<name>.<property> = <value> When a catalog is referenced by name, Spark would instantiate the specified class using a no-arg constructor. The instance would then be configured by passing a map of the remaining pairs in the spark.sql.catalog.<name>.* namespace to a configure method with the namespace part removed and an extra “name” parameter with the catalog name. This would support external sources like JDBC, which have common options like driver or hostname and port. Backward-compatibility The current spark.catalog / ExternalCatalog would be used when the catalog element of a TableIdentifier is left blank. That would provide backward-compatibility. We could optionally allow users to control the default table catalog with a property. Relationship between catalogs and data sources In the proposed table catalog API, actions return a Table object that exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs would share data source implementations by returning Table instances that use the correct data source. V2 sources would no longer need to be loaded by reflection; the catalog would be loaded instead. Tables created using format("source") or USING source in SQL specify the data source implementation directly. This “format” should be passed to the source as a table property. The existing ExternalCatalog will need to implement the new TableCatalog API for v2 sources and would continue to use the property to determine the table’s data source or format implementation. Other table catalog implementations would be free to interpret the format string as they choose or to use it to choose a data source implementation as in the default catalog. rb -- Ryan Blue Software Engineer Netflix