[DISCUSS] Multiple catalog support

Ryan Blue Mon, 23 Jul 2018 17:02:08 -0700

Lately, I’ve been working on implementing the new SQL logical plans. I’m
currently blocked working on the plans that require table metadata
operations. For example, CTAS will be implemented as a create table and a
write using DSv2 (and a drop table if anything goes wrong). That requires
something to expose the create and drop table actions: a table catalog.


Initially, I opened #21306 <https://github.com/apache/spark/pull/21306> to
get a table catalog from the data source, but that’s a bad idea because it
conflicts with future multi-catalog support. Sources are an implementation
of a read and write API that can be shared between catalogs. For example,
you could have prod and test HMS catalogs that both use the Parquet source.
The Parquet source shouldn’t determine whether a CTAS statement creates a
table in prod or test.

That means that CTAS and other plans for DataSourceV2 need a solution to
determine the catalog to use.
Proposal

I propose we add support for multiple catalogs now in support of the
DataSourceV2 work, to avoid hacky work-arounds.

First, I think we need to add catalog to TableIdentifier so tables are
identified by catalog.db.table, not just db.table. This would make it easy
to specify the intended catalog for SQL statements, like CREATE
cat.db.table AS ..., and in the DataFrame API:
df.write.saveAsTable("cat.db.table") or spark.table("cat.db.table").

Second, we will need an API for catalogs to implement. The SPIP on APIs for
Table Metadata
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#>
already proposed the API for create/alter/drop table operations. The only
part that is missing is how to register catalogs instead of using
DataSourceV2 to instantiate them.

I think we should configure catalogs through Spark config properties, like
this:

spark.sql.catalog.<name> = <impl-class>
spark.sql.catalog.<name>.<property> = <value>

When a catalog is referenced by name, Spark would instantiate the specified
class using a no-arg constructor. The instance would then be configured by
passing a map of the remaining pairs in the spark.sql.catalog.<name>.*
namespace to a configure method with the namespace part removed and an
extra “name” parameter with the catalog name. This would support external
sources like JDBC, which have common options like driver or hostname and
port.
Backward-compatibility

The current spark.catalog / ExternalCatalog would be used when the catalog
element of a TableIdentifier is left blank. That would provide
backward-compatibility. We could optionally allow users to control the
default table catalog with a property.
Relationship between catalogs and data sources

In the proposed table catalog API, actions return a Table object that
exposes the DSv2 ReadSupport and WriteSupport traits. Table catalogs would
share data source implementations by returning Table instances that use the
correct data source. V2 sources would no longer need to be loaded by
reflection; the catalog would be loaded instead.

Tables created using format("source") or USING source in SQL specify the
data source implementation directly. This “format” should be passed to the
source as a table property. The existing ExternalCatalog will need to
implement the new TableCatalog API for v2 sources and would continue to use
the property to determine the table’s data source or format implementation.
Other table catalog implementations would be free to interpret the format
string as they choose or to use it to choose a data source implementation
as in the default catalog.

rb

-- 
Ryan Blue
Software Engineer
Netflix

[DISCUSS] Multiple catalog support

Reply via email to