Re: [DISCUSS] Identifiers with multi-catalog support
Thanks for reviewing this! I'll create an SPIP doc and issue for it and call a vote. On Tue, Jan 22, 2019 at 11:41 AM Matt Cheah wrote: > +1 for n-part namespace as proposed. Agree that a short SPIP would be > appropriate for this. Perhaps also a JIRA ticket? > > > > -Matt Cheah > > > > *From: *Felix Cheung > *Date: *Sunday, January 20, 2019 at 4:48 PM > *To: *"rb...@netflix.com" , Spark Dev List < > dev@spark.apache.org> > *Subject: *Re: [DISCUSS] Identifiers with multi-catalog support > > > > +1 I like Ryan last mail. Thank you for putting it clearly (should be a > spec/SPIP!) > > > > I agree and understand the need for 3 part id. However I don’t think we > should make assumption that it must be or can only be as long as 3 parts. > Once the catalog is identified (ie. The first part), the catalog should be > responsible for resolving the namespace or schema etc. Agree also path is > good idea to add to support file-based variant. Should separator be > optional (perhaps in *space) to keep this extensible (it might not always > be ‘.’) > > > > Also this whole scheme will need to play nice with column identifier as > well. > > > > > -------------- > > *From:* Ryan Blue > *Sent:* Thursday, January 17, 2019 11:38 AM > *To:* Spark Dev List > *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support > > > > Any discussion on how Spark should manage identifiers when multiple > catalogs are supported? > > > > I know this is an area where a lot of people are interested in making > progress, and it is a blocker for both multi-catalog support and CTAS in > DSv2. > > > > On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: > > I think that the solution to this problem is to mix the two approaches by > supporting 3 identifier parts: catalog, namespace, and name, where > namespace can be an n-part identifier: > > type Namespace = Seq[String] > > case class CatalogIdentifier(space: Namespace, name: String) > > This allows catalogs to work with the hierarchy of the external store, but > the catalog API only requires a few discovery methods to list namespaces > and to list each type of object in a namespace. > > def listNamespaces(): Seq[Namespace] > > def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] > > def listTables(space: Namespace): Seq[CatalogIdentifier] > > def listViews(space: Namespace): Seq[CatalogIdentifier] > > def listFunctions(space: Namespace): Seq[CatalogIdentifier] > > The methods to list tables, views, or functions, would only return > identifiers for the type queried, not namespaces or the other objects. > > The SQL parser would be updated so that identifiers are parsed to > UnresovledIdentifier(parts: > Seq[String]), and resolution would work like this pseudo-code: > > def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, > CatalogIdentifier) = { > > val maybeCatalog = sparkSession.catalog(ident.parts.head) > > ident.parts match { > > case Seq(catalogName, *space, name) if catalog.isDefined => > > (maybeCatalog.get, CatalogIdentifier(space, name)) > > case Seq(*space, name) => > > (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) > > } > > } > > I think this is a good approach because it allows Spark users to reference > or discovery any name in the hierarchy of an external store, it uses a few > well-defined methods for discovery, and makes name hierarchy a user concern. > > · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of > listNamespaces() > > · SHOW NAMESPACES LIKE a.b% would return the result of > listNamespaces(Seq("a"), > "b") > > · USE a.b would set the current namespace to Seq("a", "b") > > · SHOW TABLES would return the result of > listTables(currentNamespace) > > Also, I think that we could generalize this a little more to support > path-based tables by adding a path to CatalogIdentifier, either as a > namespace or as a separate optional string. Then, the identifier passed to > a catalog would work for either a path-based table or a catalog table, > without needing a path-based catalog API. > > Thoughts? > > > > On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: > > In the DSv2 sync up, we tried to discuss the Table metadata proposal but > were side-tracked on its use of TableIdentifier. There were good points > about how Spark should identify tables, views, functions, etc, and I want > to start a discussion here. > > Identifiers are orthogonal to the TableCatalog
Re: [DISCUSS] Identifiers with multi-catalog support
+1 for n-part namespace as proposed. Agree that a short SPIP would be appropriate for this. Perhaps also a JIRA ticket? -Matt Cheah From: Felix Cheung Date: Sunday, January 20, 2019 at 4:48 PM To: "rb...@netflix.com" , Spark Dev List Subject: Re: [DISCUSS] Identifiers with multi-catalog support +1 I like Ryan last mail. Thank you for putting it clearly (should be a spec/SPIP!) I agree and understand the need for 3 part id. However I don’t think we should make assumption that it must be or can only be as long as 3 parts. Once the catalog is identified (ie. The first part), the catalog should be responsible for resolving the namespace or schema etc. Agree also path is good idea to add to support file-based variant. Should separator be optional (perhaps in *space) to keep this extensible (it might not always be ‘.’) Also this whole scheme will need to play nice with column identifier as well. From: Ryan Blue Sent: Thursday, January 17, 2019 11:38 AM To: Spark Dev List Subject: Re: [DISCUSS] Identifiers with multi-catalog support Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: I think that the solution to this problem is to mix the two approaches by supporting 3 identifier parts: catalog, namespace, and name, where namespace can be an n-part identifier: type Namespace = Seq[String] case class CatalogIdentifier(space: Namespace, name: String) This allows catalogs to work with the hierarchy of the external store, but the catalog API only requires a few discovery methods to list namespaces and to list each type of object in a namespace. def listNamespaces(): Seq[Namespace] def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] def listTables(space: Namespace): Seq[CatalogIdentifier] def listViews(space: Namespace): Seq[CatalogIdentifier] def listFunctions(space: Namespace): Seq[CatalogIdentifier] The methods to list tables, views, or functions, would only return identifiers for the type queried, not namespaces or the other objects. The SQL parser would be updated so that identifiers are parsed to UnresovledIdentifier(parts: Seq[String]), and resolution would work like this pseudo-code: def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, CatalogIdentifier) = { val maybeCatalog = sparkSession.catalog(ident.parts.head) ident.parts match { case Seq(catalogName, *space, name) if catalog.isDefined => (maybeCatalog.get, CatalogIdentifier(space, name)) case Seq(*space, name) => (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) } } I think this is a good approach because it allows Spark users to reference or discovery any name in the hierarchy of an external store, it uses a few well-defined methods for discovery, and makes name hierarchy a user concern. · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of listNamespaces() · SHOW NAMESPACES LIKE a.b% would return the result of listNamespaces(Seq("a"), "b") · USE a.b would set the current namespace to Seq("a", "b") · SHOW TABLES would return the result of listTables(currentNamespace) Also, I think that we could generalize this a little more to support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: In the DSv2 sync up, we tried to discuss the Table metadata proposal but were side-tracked on its use of TableIdentifier. There were good points about how Spark should identify tables, views, functions, etc, and I want to start a discussion here. Identifiers are orthogonal to the TableCatalog proposal that can be updated to use whatever identifier class we choose. That proposal is concerned with what information should be passed to define a table, and how to pass that information. The main question for this discussion is: how should Spark identify tables, views, and functions when it supports multiple catalogs? There are two main approaches: 1. Use a 3-part identifier, catalog.database.table 2. Use an identifier with an arbitrary number of parts Option 1: use 3-part identifiers The argument for option #1 is that it is simple. If an external data store has additional logical hierarchy layers, then that hierarchy would be mapped to multiple catalogs in Spark. Spark can support show tables and show databases without much trouble. This is the approach used by Presto, so there is some preced
Re: [DISCUSS] Identifiers with multi-catalog support
+1 I like Ryan last mail. Thank you for putting it clearly (should be a spec/SPIP!) I agree and understand the need for 3 part id. However I don’t think we should make assumption that it must be or can only be as long as 3 parts. Once the catalog is identified (ie. The first part), the catalog should be responsible for resolving the namespace or schema etc. Agree also path is good idea to add to support file-based variant. Should separator be optional (perhaps in *space) to keep this extensible (it might not always be ‘.’) Also this whole scheme will need to play nice with column identifier as well. From: Ryan Blue Sent: Thursday, January 17, 2019 11:38 AM To: Spark Dev List Subject: Re: [DISCUSS] Identifiers with multi-catalog support Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue mailto:rb...@netflix.com>> wrote: I think that the solution to this problem is to mix the two approaches by supporting 3 identifier parts: catalog, namespace, and name, where namespace can be an n-part identifier: type Namespace = Seq[String] case class CatalogIdentifier(space: Namespace, name: String) This allows catalogs to work with the hierarchy of the external store, but the catalog API only requires a few discovery methods to list namespaces and to list each type of object in a namespace. def listNamespaces(): Seq[Namespace] def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] def listTables(space: Namespace): Seq[CatalogIdentifier] def listViews(space: Namespace): Seq[CatalogIdentifier] def listFunctions(space: Namespace): Seq[CatalogIdentifier] The methods to list tables, views, or functions, would only return identifiers for the type queried, not namespaces or the other objects. The SQL parser would be updated so that identifiers are parsed to UnresovledIdentifier(parts: Seq[String]), and resolution would work like this pseudo-code: def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, CatalogIdentifier) = { val maybeCatalog = sparkSession.catalog(ident.parts.head) ident.parts match { case Seq(catalogName, *space, name) if catalog.isDefined => (maybeCatalog.get, CatalogIdentifier(space, name)) case Seq(*space, name) => (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) } } I think this is a good approach because it allows Spark users to reference or discovery any name in the hierarchy of an external store, it uses a few well-defined methods for discovery, and makes name hierarchy a user concern. * SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of listNamespaces() * SHOW NAMESPACES LIKE a.b% would return the result of listNamespaces(Seq("a"), "b") * USE a.b would set the current namespace to Seq("a", "b") * SHOW TABLES would return the result of listTables(currentNamespace) Also, I think that we could generalize this a little more to support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue mailto:rb...@netflix.com>> wrote: In the DSv2 sync up, we tried to discuss the Table metadata proposal but were side-tracked on its use of TableIdentifier. There were good points about how Spark should identify tables, views, functions, etc, and I want to start a discussion here. Identifiers are orthogonal to the TableCatalog proposal that can be updated to use whatever identifier class we choose. That proposal is concerned with what information should be passed to define a table, and how to pass that information. The main question for this discussion is: how should Spark identify tables, views, and functions when it supports multiple catalogs? There are two main approaches: 1. Use a 3-part identifier, catalog.database.table 2. Use an identifier with an arbitrary number of parts Option 1: use 3-part identifiers The argument for option #1 is that it is simple. If an external data store has additional logical hierarchy layers, then that hierarchy would be mapped to multiple catalogs in Spark. Spark can support show tables and show databases without much trouble. This is the approach used by Presto, so there is some precedent for it. The drawback is that mapping a more complex hierarchy into Spark requires more configuration. If an external DB has a 3-level hierarchy — say, for example, schema.database.table — then option #1 requires users to configure a catalog for each top-level stru
Re: [DISCUSS] Identifiers with multi-catalog support
Any discussion on how Spark should manage identifiers when multiple catalogs are supported? I know this is an area where a lot of people are interested in making progress, and it is a blocker for both multi-catalog support and CTAS in DSv2. On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue wrote: > I think that the solution to this problem is to mix the two approaches by > supporting 3 identifier parts: catalog, namespace, and name, where > namespace can be an n-part identifier: > > type Namespace = Seq[String] > case class CatalogIdentifier(space: Namespace, name: String) > > This allows catalogs to work with the hierarchy of the external store, but > the catalog API only requires a few discovery methods to list namespaces > and to list each type of object in a namespace. > > def listNamespaces(): Seq[Namespace] > def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] > def listTables(space: Namespace): Seq[CatalogIdentifier] > def listViews(space: Namespace): Seq[CatalogIdentifier] > def listFunctions(space: Namespace): Seq[CatalogIdentifier] > > The methods to list tables, views, or functions, would only return > identifiers for the type queried, not namespaces or the other objects. > > The SQL parser would be updated so that identifiers are parsed to > UnresovledIdentifier(parts: > Seq[String]), and resolution would work like this pseudo-code: > > def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, > CatalogIdentifier) = { > val maybeCatalog = sparkSession.catalog(ident.parts.head) > ident.parts match { > case Seq(catalogName, *space, name) if catalog.isDefined => > (maybeCatalog.get, CatalogIdentifier(space, name)) > case Seq(*space, name) => > (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) > } > } > > I think this is a good approach because it allows Spark users to reference > or discovery any name in the hierarchy of an external store, it uses a few > well-defined methods for discovery, and makes name hierarchy a user concern. > >- SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of >listNamespaces() >- SHOW NAMESPACES LIKE a.b% would return the result of > listNamespaces(Seq("a"), >"b") >- USE a.b would set the current namespace to Seq("a", "b") >- SHOW TABLES would return the result of listTables(currentNamespace) > > Also, I think that we could generalize this a little more to support > path-based tables by adding a path to CatalogIdentifier, either as a > namespace or as a separate optional string. Then, the identifier passed to > a catalog would work for either a path-based table or a catalog table, > without needing a path-based catalog API. > > Thoughts? > > On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: > >> In the DSv2 sync up, we tried to discuss the Table metadata proposal but >> were side-tracked on its use of TableIdentifier. There were good points >> about how Spark should identify tables, views, functions, etc, and I want >> to start a discussion here. >> >> Identifiers are orthogonal to the TableCatalog proposal that can be >> updated to use whatever identifier class we choose. That proposal is >> concerned with what information should be passed to define a table, and how >> to pass that information. >> >> The main question for *this* discussion is: *how should Spark identify >> tables, views, and functions when it supports multiple catalogs?* >> >> There are two main approaches: >> >>1. Use a 3-part identifier, catalog.database.table >>2. Use an identifier with an arbitrary number of parts >> >> *Option 1: use 3-part identifiers* >> >> The argument for option #1 is that it is simple. If an external data >> store has additional logical hierarchy layers, then that hierarchy would be >> mapped to multiple catalogs in Spark. Spark can support show tables and >> show databases without much trouble. This is the approach used by Presto, >> so there is some precedent for it. >> >> The drawback is that mapping a more complex hierarchy into Spark requires >> more configuration. If an external DB has a 3-level hierarchy — say, for >> example, schema.database.table — then option #1 requires users to >> configure a catalog for each top-level structure, each schema. When a new >> schema is added, it is not automatically accessible. >> >> Catalog implementations could use session options could provide a rough >> work-around by changing the plugin’s “current” schema. I think this is an >> anti-pattern, so another strike against this option is that it encourages >> bad practices. >> >> *Option 2: use n-part identifiers* >> >> That drawback for option #1 is the main argument for option #2: Spark >> should allow users to easily interact with the native structure of an >> external store. For option #2, a full identifier would be an >> arbitrary-length list of identifiers. For the example above, using >> catalog.schema.database.table is allowed. An identifier would be >> something like thi
Re: [DISCUSS] Identifiers with multi-catalog support
I think that the solution to this problem is to mix the two approaches by supporting 3 identifier parts: catalog, namespace, and name, where namespace can be an n-part identifier: type Namespace = Seq[String] case class CatalogIdentifier(space: Namespace, name: String) This allows catalogs to work with the hierarchy of the external store, but the catalog API only requires a few discovery methods to list namespaces and to list each type of object in a namespace. def listNamespaces(): Seq[Namespace] def listNamespaces(space: Namespace, prefix: String): Seq[Namespace] def listTables(space: Namespace): Seq[CatalogIdentifier] def listViews(space: Namespace): Seq[CatalogIdentifier] def listFunctions(space: Namespace): Seq[CatalogIdentifier] The methods to list tables, views, or functions, would only return identifiers for the type queried, not namespaces or the other objects. The SQL parser would be updated so that identifiers are parsed to UnresovledIdentifier(parts: Seq[String]), and resolution would work like this pseudo-code: def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, CatalogIdentifier) = { val maybeCatalog = sparkSession.catalog(ident.parts.head) ident.parts match { case Seq(catalogName, *space, name) if catalog.isDefined => (maybeCatalog.get, CatalogIdentifier(space, name)) case Seq(*space, name) => (sparkSession.defaultCatalog, CatalogIdentifier(space, name)) } } I think this is a good approach because it allows Spark users to reference or discovery any name in the hierarchy of an external store, it uses a few well-defined methods for discovery, and makes name hierarchy a user concern. - SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of listNamespaces() - SHOW NAMESPACES LIKE a.b% would return the result of listNamespaces(Seq("a"), "b") - USE a.b would set the current namespace to Seq("a", "b") - SHOW TABLES would return the result of listTables(currentNamespace) Also, I think that we could generalize this a little more to support path-based tables by adding a path to CatalogIdentifier, either as a namespace or as a separate optional string. Then, the identifier passed to a catalog would work for either a path-based table or a catalog table, without needing a path-based catalog API. Thoughts? On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue wrote: > In the DSv2 sync up, we tried to discuss the Table metadata proposal but > were side-tracked on its use of TableIdentifier. There were good points > about how Spark should identify tables, views, functions, etc, and I want > to start a discussion here. > > Identifiers are orthogonal to the TableCatalog proposal that can be > updated to use whatever identifier class we choose. That proposal is > concerned with what information should be passed to define a table, and how > to pass that information. > > The main question for *this* discussion is: *how should Spark identify > tables, views, and functions when it supports multiple catalogs?* > > There are two main approaches: > >1. Use a 3-part identifier, catalog.database.table >2. Use an identifier with an arbitrary number of parts > > *Option 1: use 3-part identifiers* > > The argument for option #1 is that it is simple. If an external data store > has additional logical hierarchy layers, then that hierarchy would be > mapped to multiple catalogs in Spark. Spark can support show tables and > show databases without much trouble. This is the approach used by Presto, > so there is some precedent for it. > > The drawback is that mapping a more complex hierarchy into Spark requires > more configuration. If an external DB has a 3-level hierarchy — say, for > example, schema.database.table — then option #1 requires users to > configure a catalog for each top-level structure, each schema. When a new > schema is added, it is not automatically accessible. > > Catalog implementations could use session options could provide a rough > work-around by changing the plugin’s “current” schema. I think this is an > anti-pattern, so another strike against this option is that it encourages > bad practices. > > *Option 2: use n-part identifiers* > > That drawback for option #1 is the main argument for option #2: Spark > should allow users to easily interact with the native structure of an > external store. For option #2, a full identifier would be an > arbitrary-length list of identifiers. For the example above, using > catalog.schema.database.table is allowed. An identifier would be > something like this: > > case class CatalogIdentifier(parts: Seq[String]) > > The problem with option #2 is how to implement a listing and discovery > API, for operations like SHOW TABLES. If the catalog API requires a > list(ident: > CatalogIdentifier), what does it return? There is no guarantee that the > listed objects are tables and not nested namespaces. How would Spark handle > arbitrary nesting that differs across catalogs? > > Hopefully, I’ve captur
Re: [DISCUSS] Identifiers with multi-catalog support
Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL and Postgres are the two most popular open source database systems, and both support database → schema → table 3 part identification, so Spark supporting only 2 part name passing to the data source (option 1) isn't sufficient. For the issues you brought up w.r.t. nesting - what's the challenge in supporting it? I can also see us not supporting it for now (no nesting allowed, leaf - 1 level can only contain leaf tables), and adding support for nesting in the future. On Sun, Jan 13, 2019 at 1:38 PM, Ryan Blue < rb...@netflix.com.invalid > wrote: > > > > In the DSv2 sync up, we tried to discuss the Table metadata proposal but > were side-tracked on its use of TableIdentifier. There were good points > about how Spark should identify tables, views, functions, etc, and I want > to start a discussion here. > > > > Identifiers are orthogonal to the TableCatalog proposal that can be > updated to use whatever identifier class we choose. That proposal is > concerned with what information should be passed to define a table, and > how to pass that information. > > > > The main question for this discussion is: *how should Spark identify > tables, views, and functions when it supports multiple catalogs?* > > > > There are two main approaches: > > * Use a 3-part identifier, catalog.database.table > * Use an identifier with an arbitrary number of parts > > > *Option 1: use 3-part identifiers* > > > > The argument for option #1 is that it is simple. If an external data store > has additional logical hierarchy layers, then that hierarchy would be > mapped to multiple catalogs in Spark. Spark can support show tables and > show databases without much trouble. This is the approach used by Presto, > so there is some precedent for it. > > > > The drawback is that mapping a more complex hierarchy into Spark requires > more configuration. If an external DB has a 3-level hierarchy — say, for > example, schema.database.table — then option #1 requires users to configure > a catalog for each top-level structure, each schema. When a new schema is > added, it is not automatically accessible. > > > > Catalog implementations could use session options could provide a rough > work-around by changing the plugin’s “current” schema. I think this is an > anti-pattern, so another strike against this option is that it encourages > bad practices. > > > > *Option 2: use n-part identifiers* > > > > That drawback for option #1 is the main argument for option #2: Spark > should allow users to easily interact with the native structure of an > external store. For option #2, a full identifier would be an > arbitrary-length list of identifiers. For the example above, using > catalog.schema.database.table > is allowed. An identifier would be something like this: > > case class CatalogIdentifier(parts: Seq[String]) > > The problem with option #2 is how to implement a listing and discovery > API, for operations like SHOW TABLES. If the catalog API requires a > list(ident: > CatalogIdentifier) , what does it return? There is no guarantee that the > listed objects are tables and not nested namespaces. How would Spark > handle arbitrary nesting that differs across catalogs? > > > > Hopefully, I’ve captured the design question well enough for a productive > discussion. Thanks! > > > > rb > > > -- > Ryan Blue > Software Engineer > Netflix >
[DISCUSS] Identifiers with multi-catalog support
In the DSv2 sync up, we tried to discuss the Table metadata proposal but were side-tracked on its use of TableIdentifier. There were good points about how Spark should identify tables, views, functions, etc, and I want to start a discussion here. Identifiers are orthogonal to the TableCatalog proposal that can be updated to use whatever identifier class we choose. That proposal is concerned with what information should be passed to define a table, and how to pass that information. The main question for *this* discussion is: *how should Spark identify tables, views, and functions when it supports multiple catalogs?* There are two main approaches: 1. Use a 3-part identifier, catalog.database.table 2. Use an identifier with an arbitrary number of parts *Option 1: use 3-part identifiers* The argument for option #1 is that it is simple. If an external data store has additional logical hierarchy layers, then that hierarchy would be mapped to multiple catalogs in Spark. Spark can support show tables and show databases without much trouble. This is the approach used by Presto, so there is some precedent for it. The drawback is that mapping a more complex hierarchy into Spark requires more configuration. If an external DB has a 3-level hierarchy — say, for example, schema.database.table — then option #1 requires users to configure a catalog for each top-level structure, each schema. When a new schema is added, it is not automatically accessible. Catalog implementations could use session options could provide a rough work-around by changing the plugin’s “current” schema. I think this is an anti-pattern, so another strike against this option is that it encourages bad practices. *Option 2: use n-part identifiers* That drawback for option #1 is the main argument for option #2: Spark should allow users to easily interact with the native structure of an external store. For option #2, a full identifier would be an arbitrary-length list of identifiers. For the example above, using catalog.schema.database.table is allowed. An identifier would be something like this: case class CatalogIdentifier(parts: Seq[String]) The problem with option #2 is how to implement a listing and discovery API, for operations like SHOW TABLES. If the catalog API requires a list(ident: CatalogIdentifier), what does it return? There is no guarantee that the listed objects are tables and not nested namespaces. How would Spark handle arbitrary nesting that differs across catalogs? Hopefully, I’ve captured the design question well enough for a productive discussion. Thanks! rb -- Ryan Blue Software Engineer Netflix