Catalog API discussion (was: DataSourceV2 community sync #3)

Ryan Blue Thu, 29 Nov 2018 15:14:23 -0800

Hi Wenchen,
I’ll add my responses inline. The answers are based on the proposed
TableCatalog API:


   - SPIP: Spark table metadata
   
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d>
   - PR #21306 <https://github.com/apache/spark/pull/21306>

On Wed, Nov 28, 2018 at 6:41 PM Wenchen Fan cloud0...@gmail.com
<http://mailto:cloud0...@gmail.com> wrote:

Thanks for hosting the discussion! I think the table catalog is super
> useful, but since this is the first time we allow users to extend catalog,
> it's better to write down some details from end-user APIs to internal
> management.
> 1. How would end-users register/unregister catalog with SQL API and
> Scala/Java API?
>
In the PR, users or administrators create catalogs by setting properties in
the SQL conf. To create and configure a test catalog implemented by
SomeCatalogClass, it looks like this:

spark.sql.catalog.test = com.example.SomeCatalogClass
spark.sql.catalog.test.config-var = value

For example, we have our own catalog, metacat, and we pass a service URI to
it and a property to tell it to use “prod” or “test” tables.

2. How would end-users manage catalogs? like LIST CATALOGS, USE CATALOG xyz?
>
Users and administrators can configure catalogs using properties like I
mentioned above. We could also implement the SQL statements like you
describe here.

Presto uses SHOW CATALOGS [LIKE prefix].

3. How to separate the abilities of catalog? Can we create a bunch of mixin
> triats for catalog API like SupportsTable, SupportsFunction, SupportsView,
> etc.?
>
What I’ve proposed is a base class, CatalogProvider
<https://github.com/apache/spark/pull/21306/files#diff-81c54123a7549b07a9d627353d9cbf95>,
that all catalogs inherit from. A CatalogProvider can be loaded as I
described above and is passed configuration through an initialize method.

Catalog implementations would also implement interfaces that carry a set of
methods for some task. What I’ve proposed is TableCatalog
<https://github.com/apache/spark/pull/21306/files#diff-a06043294c1e2c49a34aa0356f9e5450>
that exposes methods from the Table metadata APIs SPIP.

When a TableCatalog is used in a DDL statement like DROP TABLE, for
example, an analysis rule matches the raw SQL plan, resolves/loads the
catalog, and checks that it is a TableCatalog. Then passes on a logical
plan with the right catalog type:

case class DropTable(catalog: TableCatalog, table: TableIdentifier,
ifExists: Boolean) extends Command

4. How should Spark resolve identifies with catalog name? How to resolve
> ambiguity? What if the catalog doesn't support database? Can users write
> `catalogName.tblName` directly?
>
In #21978 <https://github.com/apache/spark/pull/21978>, I proposed
CatalogTableIdentifier that passes catalog, database, and table name. The
easiest and safest answer is to fill in a “current” catalog when missing
(just like the “current” database) and always interpret 2 identifiers as
database and table, never catalog and table.

How Spark decides to do this is really orthogonal to the catalog API.

5. Where does Spark store the catalog list? In an in-memory map?
>
SparkSession tracks catalog instances. Each catalog is loaded once (unless
we add some statement to reload) and cached in the session. The session is
how the current global catalog is accessed as well.

Another reason why catalogs are session-specific is that they can hold
important session-specific state. For example, Iceberg’s catalog caches
tables when loaded so that the same snapshot of a table is used for all
reads in a query. Not all table formats support this, so it is optional.

6. How to support atomic CTAS?
>
The plan we’ve discussed is to create tables with “staged” changes (see the
SPIP doc). When the write operation commits, all of the changes are
committed at once. I’m flexible on this and I think we have room for other
options as well. The current proposal only covers non-atomic CTAS.

7. The data/schema of table may change over time, when should Spark
> determine the table content? During analysis or planning?
>
Spark loads the table from a catalog during resolution rules, just like it
does with the global catalog now.

8. ...
>
> Since the catalog API is not only developer facing, but also user-facing,
> I think it's better to have a doc explaining what the developers concern
> and what the end users concern. The doc is also good for future reference,
> and can be used in release notes.
>
If you think the SPIP that I posed to the list in April needs extra
information, please let me know.

rb
-- 
Ryan Blue
Software Engineer
Netflix

Catalog API discussion (was: DataSourceV2 community sync #3)

Reply via email to