Hi Wenchen, I’ll add my responses inline. The answers are based on the proposed TableCatalog API:
- SPIP: Spark table metadata <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d> - PR #21306 <https://github.com/apache/spark/pull/21306> On Wed, Nov 28, 2018 at 6:41 PM Wenchen Fan cloud0...@gmail.com <http://mailto:cloud0...@gmail.com> wrote: Thanks for hosting the discussion! I think the table catalog is super > useful, but since this is the first time we allow users to extend catalog, > it's better to write down some details from end-user APIs to internal > management. > 1. How would end-users register/unregister catalog with SQL API and > Scala/Java API? > In the PR, users or administrators create catalogs by setting properties in the SQL conf. To create and configure a test catalog implemented by SomeCatalogClass, it looks like this: spark.sql.catalog.test = com.example.SomeCatalogClass spark.sql.catalog.test.config-var = value For example, we have our own catalog, metacat, and we pass a service URI to it and a property to tell it to use “prod” or “test” tables. 2. How would end-users manage catalogs? like LIST CATALOGS, USE CATALOG xyz? > Users and administrators can configure catalogs using properties like I mentioned above. We could also implement the SQL statements like you describe here. Presto uses SHOW CATALOGS [LIKE prefix]. 3. How to separate the abilities of catalog? Can we create a bunch of mixin > triats for catalog API like SupportsTable, SupportsFunction, SupportsView, > etc.? > What I’ve proposed is a base class, CatalogProvider <https://github.com/apache/spark/pull/21306/files#diff-81c54123a7549b07a9d627353d9cbf95>, that all catalogs inherit from. A CatalogProvider can be loaded as I described above and is passed configuration through an initialize method. Catalog implementations would also implement interfaces that carry a set of methods for some task. What I’ve proposed is TableCatalog <https://github.com/apache/spark/pull/21306/files#diff-a06043294c1e2c49a34aa0356f9e5450> that exposes methods from the Table metadata APIs SPIP. When a TableCatalog is used in a DDL statement like DROP TABLE, for example, an analysis rule matches the raw SQL plan, resolves/loads the catalog, and checks that it is a TableCatalog. Then passes on a logical plan with the right catalog type: case class DropTable(catalog: TableCatalog, table: TableIdentifier, ifExists: Boolean) extends Command 4. How should Spark resolve identifies with catalog name? How to resolve > ambiguity? What if the catalog doesn't support database? Can users write > `catalogName.tblName` directly? > In #21978 <https://github.com/apache/spark/pull/21978>, I proposed CatalogTableIdentifier that passes catalog, database, and table name. The easiest and safest answer is to fill in a “current” catalog when missing (just like the “current” database) and always interpret 2 identifiers as database and table, never catalog and table. How Spark decides to do this is really orthogonal to the catalog API. 5. Where does Spark store the catalog list? In an in-memory map? > SparkSession tracks catalog instances. Each catalog is loaded once (unless we add some statement to reload) and cached in the session. The session is how the current global catalog is accessed as well. Another reason why catalogs are session-specific is that they can hold important session-specific state. For example, Iceberg’s catalog caches tables when loaded so that the same snapshot of a table is used for all reads in a query. Not all table formats support this, so it is optional. 6. How to support atomic CTAS? > The plan we’ve discussed is to create tables with “staged” changes (see the SPIP doc). When the write operation commits, all of the changes are committed at once. I’m flexible on this and I think we have room for other options as well. The current proposal only covers non-atomic CTAS. 7. The data/schema of table may change over time, when should Spark > determine the table content? During analysis or planning? > Spark loads the table from a catalog during resolution rules, just like it does with the global catalog now. 8. ... > > Since the catalog API is not only developer facing, but also user-facing, > I think it's better to have a doc explaining what the developers concern > and what the end users concern. The doc is also good for future reference, > and can be used in release notes. > If you think the SPIP that I posed to the list in April needs extra information, please let me know. rb -- Ryan Blue Software Engineer Netflix