Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Ryan Blue
Thanks for reviewing this! I'll create an SPIP doc and issue for it and
call a vote.

On Tue, Jan 22, 2019 at 11:41 AM Matt Cheah  wrote:

> +1 for n-part namespace as proposed. Agree that a short SPIP would be
> appropriate for this. Perhaps also a JIRA ticket?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Felix Cheung 
> *Date: *Sunday, January 20, 2019 at 4:48 PM
> *To: *"rb...@netflix.com" , Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> +1 I like Ryan last mail. Thank you for putting it clearly (should be a
> spec/SPIP!)
>
>
>
> I agree and understand the need for 3 part id. However I don’t think we
> should make assumption that it must be or can only be as long as 3 parts.
> Once the catalog is identified (ie. The first part), the catalog should be
> responsible for resolving the namespace or schema etc. Agree also path is
> good idea to add to support file-based variant. Should separator be
> optional (perhaps in *space) to keep this extensible (it might not always
> be ‘.’)
>
>
>
> Also this whole scheme will need to play nice with column identifier as
> well.
>
>
>
>
> --------------
>
> *From:* Ryan Blue 
> *Sent:* Thursday, January 17, 2019 11:38 AM
> *To:* Spark Dev List
> *Subject:* Re: [DISCUSS] Identifiers with multi-catalog support
>
>
>
> Any discussion on how Spark should manage identifiers when multiple
> catalogs are supported?
>
>
>
> I know this is an area where a lot of people are interested in making
> progress, and it is a blocker for both multi-catalog support and CTAS in
> DSv2.
>
>
>
> On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue  wrote:
>
> I think that the solution to this problem is to mix the two approaches by
> supporting 3 identifier parts: catalog, namespace, and name, where
> namespace can be an n-part identifier:
>
> type Namespace = Seq[String]
>
> case class CatalogIdentifier(space: Namespace, name: String)
>
> This allows catalogs to work with the hierarchy of the external store, but
> the catalog API only requires a few discovery methods to list namespaces
> and to list each type of object in a namespace.
>
> def listNamespaces(): Seq[Namespace]
>
> def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
>
> def listTables(space: Namespace): Seq[CatalogIdentifier]
>
> def listViews(space: Namespace): Seq[CatalogIdentifier]
>
> def listFunctions(space: Namespace): Seq[CatalogIdentifier]
>
> The methods to list tables, views, or functions, would only return
> identifiers for the type queried, not namespaces or the other objects.
>
> The SQL parser would be updated so that identifiers are parsed to 
> UnresovledIdentifier(parts:
> Seq[String]), and resolution would work like this pseudo-code:
>
> def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
> CatalogIdentifier) = {
>
>   val maybeCatalog = sparkSession.catalog(ident.parts.head)
>
>   ident.parts match {
>
> case Seq(catalogName, *space, name) if catalog.isDefined =>
>
>   (maybeCatalog.get, CatalogIdentifier(space, name))
>
> case Seq(*space, name) =>
>
>   (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
>
>   }
>
> }
>
> I think this is a good approach because it allows Spark users to reference
> or discovery any name in the hierarchy of an external store, it uses a few
> well-defined methods for discovery, and makes name hierarchy a user concern.
>
> · SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of
> listNamespaces()
>
> · SHOW NAMESPACES LIKE a.b% would return the result of 
> listNamespaces(Seq("a"),
> "b")
>
> · USE a.b would set the current namespace to Seq("a", "b")
>
> · SHOW TABLES would return the result of
> listTables(currentNamespace)
>
> Also, I think that we could generalize this a little more to support
> path-based tables by adding a path to CatalogIdentifier, either as a
> namespace or as a separate optional string. Then, the identifier passed to
> a catalog would work for either a path-based table or a catalog table,
> without needing a path-based catalog API.
>
> Thoughts?
>
>
>
> On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:
>
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
>
> Identifiers are orthogonal to the TableCatalog

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-22 Thread Matt Cheah
+1 for n-part namespace as proposed. Agree that a short SPIP would be 
appropriate for this. Perhaps also a JIRA ticket?

 

-Matt Cheah

 

From: Felix Cheung 
Date: Sunday, January 20, 2019 at 4:48 PM
To: "rb...@netflix.com" , Spark Dev List 

Subject: Re: [DISCUSS] Identifiers with multi-catalog support

 

+1 I like Ryan last mail. Thank you for putting it clearly (should be a 
spec/SPIP!)

 

I agree and understand the need for 3 part id. However I don’t think we should 
make assumption that it must be or can only be as long as 3 parts. Once the 
catalog is identified (ie. The first part), the catalog should be responsible 
for resolving the namespace or schema etc. Agree also path is good idea to add 
to support file-based variant. Should separator be optional (perhaps in *space) 
to keep this extensible (it might not always be ‘.’)

 

Also this whole scheme will need to play nice with column identifier as well.

 

 

From: Ryan Blue 
Sent: Thursday, January 17, 2019 11:38 AM
To: Spark Dev List
Subject: Re: [DISCUSS] Identifiers with multi-catalog support 

 

Any discussion on how Spark should manage identifiers when multiple catalogs 
are supported? 

 

I know this is an area where a lot of people are interested in making progress, 
and it is a blocker for both multi-catalog support and CTAS in DSv2.

 

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue  wrote:

I think that the solution to this problem is to mix the two approaches by 
supporting 3 identifier parts: catalog, namespace, and name, where namespace 
can be an n-part identifier:
type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)
This allows catalogs to work with the hierarchy of the external store, but the 
catalog API only requires a few discovery methods to list namespaces and to 
list each type of object in a namespace.
def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]
The methods to list tables, views, or functions, would only return identifiers 
for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to 
UnresovledIdentifier(parts: Seq[String]), and resolution would work like this 
pseudo-code:
def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
    case Seq(catalogName, *space, name) if catalog.isDefined =>
  (maybeCatalog.get, CatalogIdentifier(space, name))
    case Seq(*space, name) =>
  (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}
I think this is a good approach because it allows Spark users to reference or 
discovery any name in the hierarchy of an external store, it uses a few 
well-defined methods for discovery, and makes name hierarchy a user concern.

· SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of 
listNamespaces() 

· SHOW NAMESPACES LIKE a.b% would return the result of 
listNamespaces(Seq("a"), "b") 

· USE a.b would set the current namespace to Seq("a", "b") 

· SHOW TABLES would return the result of listTables(currentNamespace) 

Also, I think that we could generalize this a little more to support path-based 
tables by adding a path to CatalogIdentifier, either as a namespace or as a 
separate optional string. Then, the identifier passed to a catalog would work 
for either a path-based table or a catalog table, without needing a path-based 
catalog API.

Thoughts?

 

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:

In the DSv2 sync up, we tried to discuss the Table metadata proposal but were 
side-tracked on its use of TableIdentifier. There were good points about how 
Spark should identify tables, views, functions, etc, and I want to start a 
discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated to 
use whatever identifier class we choose. That proposal is concerned with what 
information should be passed to define a table, and how to pass that 
information.

The main question for this discussion is: how should Spark identify tables, 
views, and functions when it supports multiple catalogs?

There are two main approaches:

1.   Use a 3-part identifier, catalog.database.table 

2.   Use an identifier with an arbitrary number of parts 

Option 1: use 3-part identifiers

The argument for option #1 is that it is simple. If an external data store has 
additional logical hierarchy layers, then that hierarchy would be mapped to 
multiple catalogs in Spark. Spark can support show tables and show databases 
without much trouble. This is the approach used by Presto, so there is some 
preced

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-20 Thread Felix Cheung
+1 I like Ryan last mail. Thank you for putting it clearly (should be a 
spec/SPIP!)

I agree and understand the need for 3 part id. However I don’t think we should 
make assumption that it must be or can only be as long as 3 parts. Once the 
catalog is identified (ie. The first part), the catalog should be responsible 
for resolving the namespace or schema etc. Agree also path is good idea to add 
to support file-based variant. Should separator be optional (perhaps in *space) 
to keep this extensible (it might not always be ‘.’)

Also this whole scheme will need to play nice with column identifier as well.



From: Ryan Blue 
Sent: Thursday, January 17, 2019 11:38 AM
To: Spark Dev List
Subject: Re: [DISCUSS] Identifiers with multi-catalog support

Any discussion on how Spark should manage identifiers when multiple catalogs 
are supported?

I know this is an area where a lot of people are interested in making progress, 
and it is a blocker for both multi-catalog support and CTAS in DSv2.

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:

I think that the solution to this problem is to mix the two approaches by 
supporting 3 identifier parts: catalog, namespace, and name, where namespace 
can be an n-part identifier:

type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)


This allows catalogs to work with the hierarchy of the external store, but the 
catalog API only requires a few discovery methods to list namespaces and to 
list each type of object in a namespace.

def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]


The methods to list tables, views, or functions, would only return identifiers 
for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to 
UnresovledIdentifier(parts: Seq[String]), and resolution would work like this 
pseudo-code:

def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
case Seq(catalogName, *space, name) if catalog.isDefined =>
  (maybeCatalog.get, CatalogIdentifier(space, name))
case Seq(*space, name) =>
  (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}


I think this is a good approach because it allows Spark users to reference or 
discovery any name in the hierarchy of an external store, it uses a few 
well-defined methods for discovery, and makes name hierarchy a user concern.

  *   SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of 
listNamespaces()
  *   SHOW NAMESPACES LIKE a.b% would return the result of 
listNamespaces(Seq("a"), "b")
  *   USE a.b would set the current namespace to Seq("a", "b")
  *   SHOW TABLES would return the result of listTables(currentNamespace)

Also, I think that we could generalize this a little more to support path-based 
tables by adding a path to CatalogIdentifier, either as a namespace or as a 
separate optional string. Then, the identifier passed to a catalog would work 
for either a path-based table or a catalog table, without needing a path-based 
catalog API.

Thoughts?

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue 
mailto:rb...@netflix.com>> wrote:

In the DSv2 sync up, we tried to discuss the Table metadata proposal but were 
side-tracked on its use of TableIdentifier. There were good points about how 
Spark should identify tables, views, functions, etc, and I want to start a 
discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated to 
use whatever identifier class we choose. That proposal is concerned with what 
information should be passed to define a table, and how to pass that 
information.

The main question for this discussion is: how should Spark identify tables, 
views, and functions when it supports multiple catalogs?

There are two main approaches:

  1.  Use a 3-part identifier, catalog.database.table
  2.  Use an identifier with an arbitrary number of parts

Option 1: use 3-part identifiers

The argument for option #1 is that it is simple. If an external data store has 
additional logical hierarchy layers, then that hierarchy would be mapped to 
multiple catalogs in Spark. Spark can support show tables and show databases 
without much trouble. This is the approach used by Presto, so there is some 
precedent for it.

The drawback is that mapping a more complex hierarchy into Spark requires more 
configuration. If an external DB has a 3-level hierarchy — say, for example, 
schema.database.table — then option #1 requires users to configure a catalog 
for each top-level stru

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-17 Thread Ryan Blue
Any discussion on how Spark should manage identifiers when multiple
catalogs are supported?

I know this is an area where a lot of people are interested in making
progress, and it is a blocker for both multi-catalog support and CTAS in
DSv2.

On Sun, Jan 13, 2019 at 2:22 PM Ryan Blue  wrote:

> I think that the solution to this problem is to mix the two approaches by
> supporting 3 identifier parts: catalog, namespace, and name, where
> namespace can be an n-part identifier:
>
> type Namespace = Seq[String]
> case class CatalogIdentifier(space: Namespace, name: String)
>
> This allows catalogs to work with the hierarchy of the external store, but
> the catalog API only requires a few discovery methods to list namespaces
> and to list each type of object in a namespace.
>
> def listNamespaces(): Seq[Namespace]
> def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
> def listTables(space: Namespace): Seq[CatalogIdentifier]
> def listViews(space: Namespace): Seq[CatalogIdentifier]
> def listFunctions(space: Namespace): Seq[CatalogIdentifier]
>
> The methods to list tables, views, or functions, would only return
> identifiers for the type queried, not namespaces or the other objects.
>
> The SQL parser would be updated so that identifiers are parsed to 
> UnresovledIdentifier(parts:
> Seq[String]), and resolution would work like this pseudo-code:
>
> def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin, 
> CatalogIdentifier) = {
>   val maybeCatalog = sparkSession.catalog(ident.parts.head)
>   ident.parts match {
> case Seq(catalogName, *space, name) if catalog.isDefined =>
>   (maybeCatalog.get, CatalogIdentifier(space, name))
> case Seq(*space, name) =>
>   (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
>   }
> }
>
> I think this is a good approach because it allows Spark users to reference
> or discovery any name in the hierarchy of an external store, it uses a few
> well-defined methods for discovery, and makes name hierarchy a user concern.
>
>- SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of
>listNamespaces()
>- SHOW NAMESPACES LIKE a.b% would return the result of 
> listNamespaces(Seq("a"),
>"b")
>- USE a.b would set the current namespace to Seq("a", "b")
>- SHOW TABLES would return the result of listTables(currentNamespace)
>
> Also, I think that we could generalize this a little more to support
> path-based tables by adding a path to CatalogIdentifier, either as a
> namespace or as a separate optional string. Then, the identifier passed to
> a catalog would work for either a path-based table or a catalog table,
> without needing a path-based catalog API.
>
> Thoughts?
>
> On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:
>
>> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
>> were side-tracked on its use of TableIdentifier. There were good points
>> about how Spark should identify tables, views, functions, etc, and I want
>> to start a discussion here.
>>
>> Identifiers are orthogonal to the TableCatalog proposal that can be
>> updated to use whatever identifier class we choose. That proposal is
>> concerned with what information should be passed to define a table, and how
>> to pass that information.
>>
>> The main question for *this* discussion is: *how should Spark identify
>> tables, views, and functions when it supports multiple catalogs?*
>>
>> There are two main approaches:
>>
>>1. Use a 3-part identifier, catalog.database.table
>>2. Use an identifier with an arbitrary number of parts
>>
>> *Option 1: use 3-part identifiers*
>>
>> The argument for option #1 is that it is simple. If an external data
>> store has additional logical hierarchy layers, then that hierarchy would be
>> mapped to multiple catalogs in Spark. Spark can support show tables and
>> show databases without much trouble. This is the approach used by Presto,
>> so there is some precedent for it.
>>
>> The drawback is that mapping a more complex hierarchy into Spark requires
>> more configuration. If an external DB has a 3-level hierarchy — say, for
>> example, schema.database.table — then option #1 requires users to
>> configure a catalog for each top-level structure, each schema. When a new
>> schema is added, it is not automatically accessible.
>>
>> Catalog implementations could use session options could provide a rough
>> work-around by changing the plugin’s “current” schema. I think this is an
>> anti-pattern, so another strike against this option is that it encourages
>> bad practices.
>>
>> *Option 2: use n-part identifiers*
>>
>> That drawback for option #1 is the main argument for option #2: Spark
>> should allow users to easily interact with the native structure of an
>> external store. For option #2, a full identifier would be an
>> arbitrary-length list of identifiers. For the example above, using
>> catalog.schema.database.table is allowed. An identifier would be
>> something like thi

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
I think that the solution to this problem is to mix the two approaches by
supporting 3 identifier parts: catalog, namespace, and name, where
namespace can be an n-part identifier:

type Namespace = Seq[String]
case class CatalogIdentifier(space: Namespace, name: String)

This allows catalogs to work with the hierarchy of the external store, but
the catalog API only requires a few discovery methods to list namespaces
and to list each type of object in a namespace.

def listNamespaces(): Seq[Namespace]
def listNamespaces(space: Namespace, prefix: String): Seq[Namespace]
def listTables(space: Namespace): Seq[CatalogIdentifier]
def listViews(space: Namespace): Seq[CatalogIdentifier]
def listFunctions(space: Namespace): Seq[CatalogIdentifier]

The methods to list tables, views, or functions, would only return
identifiers for the type queried, not namespaces or the other objects.

The SQL parser would be updated so that identifiers are parsed to
UnresovledIdentifier(parts:
Seq[String]), and resolution would work like this pseudo-code:

def resolveIdentifier(ident: UnresolvedIdentifier): (CatalogPlugin,
CatalogIdentifier) = {
  val maybeCatalog = sparkSession.catalog(ident.parts.head)
  ident.parts match {
case Seq(catalogName, *space, name) if catalog.isDefined =>
  (maybeCatalog.get, CatalogIdentifier(space, name))
case Seq(*space, name) =>
  (sparkSession.defaultCatalog, CatalogIdentifier(space, name))
  }
}

I think this is a good approach because it allows Spark users to reference
or discovery any name in the hierarchy of an external store, it uses a few
well-defined methods for discovery, and makes name hierarchy a user concern.

   - SHOW (DATABASES|SCHEMAS|NAMESPACES) would return the result of
   listNamespaces()
   - SHOW NAMESPACES LIKE a.b% would return the result of
listNamespaces(Seq("a"),
   "b")
   - USE a.b would set the current namespace to Seq("a", "b")
   - SHOW TABLES would return the result of listTables(currentNamespace)

Also, I think that we could generalize this a little more to support
path-based tables by adding a path to CatalogIdentifier, either as a
namespace or as a separate optional string. Then, the identifier passed to
a catalog would work for either a path-based table or a catalog table,
without needing a path-based catalog API.

Thoughts?

On Sun, Jan 13, 2019 at 1:38 PM Ryan Blue  wrote:

> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
>
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and how
> to pass that information.
>
> The main question for *this* discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
>
> There are two main approaches:
>
>1. Use a 3-part identifier, catalog.database.table
>2. Use an identifier with an arbitrary number of parts
>
> *Option 1: use 3-part identifiers*
>
> The argument for option #1 is that it is simple. If an external data store
> has additional logical hierarchy layers, then that hierarchy would be
> mapped to multiple catalogs in Spark. Spark can support show tables and
> show databases without much trouble. This is the approach used by Presto,
> so there is some precedent for it.
>
> The drawback is that mapping a more complex hierarchy into Spark requires
> more configuration. If an external DB has a 3-level hierarchy — say, for
> example, schema.database.table — then option #1 requires users to
> configure a catalog for each top-level structure, each schema. When a new
> schema is added, it is not automatically accessible.
>
> Catalog implementations could use session options could provide a rough
> work-around by changing the plugin’s “current” schema. I think this is an
> anti-pattern, so another strike against this option is that it encourages
> bad practices.
>
> *Option 2: use n-part identifiers*
>
> That drawback for option #1 is the main argument for option #2: Spark
> should allow users to easily interact with the native structure of an
> external store. For option #2, a full identifier would be an
> arbitrary-length list of identifiers. For the example above, using
> catalog.schema.database.table is allowed. An identifier would be
> something like this:
>
> case class CatalogIdentifier(parts: Seq[String])
>
> The problem with option #2 is how to implement a listing and discovery
> API, for operations like SHOW TABLES. If the catalog API requires a 
> list(ident:
> CatalogIdentifier), what does it return? There is no guarantee that the
> listed objects are tables and not nested namespaces. How would Spark handle
> arbitrary nesting that differs across catalogs?
>
> Hopefully, I’ve captur

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Reynold Xin
Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL 
and Postgres are the two most popular open source database systems, and both 
support database → schema → table 3 part identification, so Spark supporting 
only 2 part name passing to the data source (option 1) isn't sufficient.

For the issues you brought up w.r.t. nesting - what's the challenge in 
supporting it? I can also see us not supporting it for now (no nesting allowed, 
leaf - 1 level can only contain leaf tables), and adding support for nesting in 
the future.

On Sun, Jan 13, 2019 at 1:38 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
> 
> 
> 
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and
> how to pass that information.
> 
> 
> 
> The main question for this discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
> 
> 
> 
> There are two main approaches:
> 
> * Use a 3-part identifier, catalog.database.table
> * Use an identifier with an arbitrary number of parts
> 
> 
> *Option 1: use 3-part identifiers*
> 
> 
> 
> The argument for option #1 is that it is simple. If an external data store
> has additional logical hierarchy layers, then that hierarchy would be
> mapped to multiple catalogs in Spark. Spark can support show tables and
> show databases without much trouble. This is the approach used by Presto,
> so there is some precedent for it.
> 
> 
> 
> The drawback is that mapping a more complex hierarchy into Spark requires
> more configuration. If an external DB has a 3-level hierarchy — say, for
> example, schema.database.table — then option #1 requires users to configure
> a catalog for each top-level structure, each schema. When a new schema is
> added, it is not automatically accessible.
> 
> 
> 
> Catalog implementations could use session options could provide a rough
> work-around by changing the plugin’s “current” schema. I think this is an
> anti-pattern, so another strike against this option is that it encourages
> bad practices.
> 
> 
> 
> *Option 2: use n-part identifiers*
> 
> 
> 
> That drawback for option #1 is the main argument for option #2: Spark
> should allow users to easily interact with the native structure of an
> external store. For option #2, a full identifier would be an
> arbitrary-length list of identifiers. For the example above, using 
> catalog.schema.database.table
> is allowed. An identifier would be something like this:
> 
> case class CatalogIdentifier(parts: Seq[String])
> 
> The problem with option #2 is how to implement a listing and discovery
> API, for operations like SHOW TABLES. If the catalog API requires a 
> list(ident:
> CatalogIdentifier) , what does it return? There is no guarantee that the
> listed objects are tables and not nested namespaces. How would Spark
> handle arbitrary nesting that differs across catalogs?
> 
> 
> 
> Hopefully, I’ve captured the design question well enough for a productive
> discussion. Thanks!
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

[DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Ryan Blue
In the DSv2 sync up, we tried to discuss the Table metadata proposal but
were side-tracked on its use of TableIdentifier. There were good points
about how Spark should identify tables, views, functions, etc, and I want
to start a discussion here.

Identifiers are orthogonal to the TableCatalog proposal that can be updated
to use whatever identifier class we choose. That proposal is concerned with
what information should be passed to define a table, and how to pass that
information.

The main question for *this* discussion is: *how should Spark identify
tables, views, and functions when it supports multiple catalogs?*

There are two main approaches:

   1. Use a 3-part identifier, catalog.database.table
   2. Use an identifier with an arbitrary number of parts

*Option 1: use 3-part identifiers*

The argument for option #1 is that it is simple. If an external data store
has additional logical hierarchy layers, then that hierarchy would be
mapped to multiple catalogs in Spark. Spark can support show tables and
show databases without much trouble. This is the approach used by Presto,
so there is some precedent for it.

The drawback is that mapping a more complex hierarchy into Spark requires
more configuration. If an external DB has a 3-level hierarchy — say, for
example, schema.database.table — then option #1 requires users to configure
a catalog for each top-level structure, each schema. When a new schema is
added, it is not automatically accessible.

Catalog implementations could use session options could provide a rough
work-around by changing the plugin’s “current” schema. I think this is an
anti-pattern, so another strike against this option is that it encourages
bad practices.

*Option 2: use n-part identifiers*

That drawback for option #1 is the main argument for option #2: Spark
should allow users to easily interact with the native structure of an
external store. For option #2, a full identifier would be an
arbitrary-length list of identifiers. For the example above, using
catalog.schema.database.table is allowed. An identifier would be something
like this:

case class CatalogIdentifier(parts: Seq[String])

The problem with option #2 is how to implement a listing and discovery API,
for operations like SHOW TABLES. If the catalog API requires a list(ident:
CatalogIdentifier), what does it return? There is no guarantee that the
listed objects are tables and not nested namespaces. How would Spark handle
arbitrary nesting that differs across catalogs?

Hopefully, I’ve captured the design question well enough for a productive
discussion. Thanks!

rb
-- 
Ryan Blue
Software Engineer
Netflix