Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
I don’t think Spark ever claims to be 100% Hive compatible.

I just found some relevant documentation

on this, where Databricks claims that “Apache Spark SQL in Databricks is
designed to be compatible with the Apache Hive, including metastore
connectivity, SerDes, and UDFs.”

There is a strong expectation that Spark is compatible with Hive, and the
Spark community has made the claim. That isn’t a claim of 100%
compatibility, but no one suggested that there was 100% compatibility.

On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue  wrote:

> I don’t think Spark ever claims to be 100% Hive compatible.
>
> By accepting the EXTERNAL keyword in some circumstances, Spark is
> providing compatibility with Hive DDL. Yes, there are places where it
> breaks. The question is whether we should deliberately break what a Hive
> catalog could implement, when we know what Hive’s behavior is.
>
> CREATE EXTERNAL TABLE is not a Hive-specific feature
>
> Great. So there are other catalogs that could use it. Why should Spark
> choose to limit Hive’s interpretation of this keyword?
>
> While it is great that we seem to agree that Spark shouldn’t do this — now
> that Nessie was pointed out — I’m concerned that you still seem to think
> this is a choice that Spark could reasonably make. *Spark cannot
> arbitrarily choose how to interpret DDL for an external catalog*.
>
> You may not consider this arbitrary because there are other examples where
> location is required. But the Hive community made the choice that these
> clauses are orthogonal, so it is clearly a choice of the external system,
> and it is not Spark’s role to dictate how an external database should
> behave.
>
> I think the Nessie use case is good enough to justify the decoupling of
> EXTERNAL and LOCATION.
>
> It appears that we have consensus. This will be passed to catalogs, which
> can implement the behavior that they choose.
>
> On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan  wrote:
>
>> > I have some hive queries that I want to run on Spark.
>>
>> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
>> LOCATION can't help you too much here. If you do have this use case, we
>> need a much wider discussion about how to achieve it.
>>
>> For this particular topic, we need concrete use cases like Nessie
>> . It will be great to see more
>> concrete use cases, but I think the Nessie use case is good enough to
>> justify the decoupling of EXTERNAL and LOCATION.
>>
>> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
>> have it. That's why I think Hive-compatibility alone is not a reasonable
>> use case. For your reference:
>> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> 
>> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
>> clause as Spark does: doc
>> 
>> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or
>> FILE_NAME option: doc
>> 
>> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
>> 
>>
>> > with which Spark claims to be compatible
>>
>> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
>> diverged from Hive intentionally in several places, where we think the Hive
>> behavior was not reasonable and we shouldn't follow it.
>>
>> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue  wrote:
>>
>>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>>> external table.
>>>
>>> I think there is some confusion about what Spark has to handle.
>>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>>> MetaStore that Spark connects to, and the general expectation is that Spark
>>> doesn’t change the meaning of table configuration. There are notable bugs
>>> where Spark has different behavior, but that is the expectation.
>>>
>>> In this particular case, we’re talking about what can be expressed in
>>> DDL that is sent to an external catalog. Spark could (unwisely) choose to
>>> disallow some DDL combinations, but the table is implemented through a
>>> plugin so the interpretation is up to the plugin. Spark has no role in
>>> choosing how to treat this table, unless it is loaded through Spark’s
>>> built-in catalog; in which case, see above.
>>>
>>> I don’t think Hive compatibility itself is a “use case”.
>>>
>>> Why?
>>>
>>> Hive is an external database that defines its own behavior and with
>>> which Spark claims to be compatible. If Hive isn’t a valid use case, 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
I don’t think Spark ever claims to be 100% Hive compatible.

By accepting the EXTERNAL keyword in some circumstances, Spark is providing
compatibility with Hive DDL. Yes, there are places where it breaks. The
question is whether we should deliberately break what a Hive catalog could
implement, when we know what Hive’s behavior is.

CREATE EXTERNAL TABLE is not a Hive-specific feature

Great. So there are other catalogs that could use it. Why should Spark
choose to limit Hive’s interpretation of this keyword?

While it is great that we seem to agree that Spark shouldn’t do this — now
that Nessie was pointed out — I’m concerned that you still seem to think
this is a choice that Spark could reasonably make. *Spark cannot
arbitrarily choose how to interpret DDL for an external catalog*.

You may not consider this arbitrary because there are other examples where
location is required. But the Hive community made the choice that these
clauses are orthogonal, so it is clearly a choice of the external system,
and it is not Spark’s role to dictate how an external database should
behave.

I think the Nessie use case is good enough to justify the decoupling of
EXTERNAL and LOCATION.

It appears that we have consensus. This will be passed to catalogs, which
can implement the behavior that they choose.

On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan  wrote:

> > I have some hive queries that I want to run on Spark.
>
> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
> LOCATION can't help you too much here. If you do have this use case, we
> need a much wider discussion about how to achieve it.
>
> For this particular topic, we need concrete use cases like Nessie
> . It will be great to see more
> concrete use cases, but I think the Nessie use case is good enough to
> justify the decoupling of EXTERNAL and LOCATION.
>
> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
> have it. That's why I think Hive-compatibility alone is not a reasonable
> use case. For your reference:
> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> 
> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION
> clause as Spark does: doc
> 
> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
> option: doc
> 
> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc
> 
>
> > with which Spark claims to be compatible
>
> I don't think Spark ever claims to be 100% Hive compatible. In fact, we
> diverged from Hive intentionally in several places, where we think the Hive
> behavior was not reasonable and we shouldn't follow it.
>
> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue  wrote:
>
>> how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table.
>>
>> I think there is some confusion about what Spark has to handle.
>> Regardless of what Spark allows as DDL, these tables can exist in a Hive
>> MetaStore that Spark connects to, and the general expectation is that Spark
>> doesn’t change the meaning of table configuration. There are notable bugs
>> where Spark has different behavior, but that is the expectation.
>>
>> In this particular case, we’re talking about what can be expressed in DDL
>> that is sent to an external catalog. Spark could (unwisely) choose to
>> disallow some DDL combinations, but the table is implemented through a
>> plugin so the interpretation is up to the plugin. Spark has no role in
>> choosing how to treat this table, unless it is loaded through Spark’s
>> built-in catalog; in which case, see above.
>>
>> I don’t think Hive compatibility itself is a “use case”.
>>
>> Why?
>>
>> Hive is an external database that defines its own behavior and with which
>> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
>> EXTERNAL supported at all?
>>
>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau 
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:
>>>
 I don't think Hive compatibility itself is a "use case".

>>> Ok let's add on top of this: I have some hive queries that I want to run
>>> on Spark. I believe that makes it a use case.
>>>
 The Nessie  example you
 mentioned is a reasonable use case to me: some frameworks/applications want
 to create external tables without user-specified location, so that they can
 manage the table directory themselves and implement fancy features.

 That said, now I agree it's better to decouple EXTERNAL and 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
> I have some hive queries that I want to run on Spark.

Spark is not compatible with Hive in many places. Decoupling EXTERNAL and
LOCATION can't help you too much here. If you do have this use case, we
need a much wider discussion about how to achieve it.

For this particular topic, we need concrete use cases like Nessie
. It will be great to see more
concrete use cases, but I think the Nessie use case is good enough to
justify the decoupling of EXTERNAL and LOCATION.

BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases
have it. That's why I think Hive-compatibility alone is not a reasonable
use case. For your reference:
1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION
clause as Spark does: doc

2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION clause
as Spark does: doc

3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or FILE_NAME
option: doc

4. SQL Server also supports CREATE EXTERNAL TABLE: doc


> with which Spark claims to be compatible

I don't think Spark ever claims to be 100% Hive compatible. In fact, we
diverged from Hive intentionally in several places, where we think the Hive
behavior was not reasonable and we shouldn't follow it.

On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue  wrote:

> how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table.
>
> I think there is some confusion about what Spark has to handle. Regardless
> of what Spark allows as DDL, these tables can exist in a Hive MetaStore
> that Spark connects to, and the general expectation is that Spark doesn’t
> change the meaning of table configuration. There are notable bugs where
> Spark has different behavior, but that is the expectation.
>
> In this particular case, we’re talking about what can be expressed in DDL
> that is sent to an external catalog. Spark could (unwisely) choose to
> disallow some DDL combinations, but the table is implemented through a
> plugin so the interpretation is up to the plugin. Spark has no role in
> choosing how to treat this table, unless it is loaded through Spark’s
> built-in catalog; in which case, see above.
>
> I don’t think Hive compatibility itself is a “use case”.
>
> Why?
>
> Hive is an external database that defines its own behavior and with which
> Spark claims to be compatible. If Hive isn’t a valid use case, then why is
> EXTERNAL supported at all?
>
> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau  wrote:
>
>>
>>
>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:
>>
>>> I don't think Hive compatibility itself is a "use case".
>>>
>> Ok let's add on top of this: I have some hive queries that I want to run
>> on Spark. I believe that makes it a use case.
>>
>>> The Nessie  example you
>>> mentioned is a reasonable use case to me: some frameworks/applications want
>>> to create external tables without user-specified location, so that they can
>>> manage the table directory themselves and implement fancy features.
>>>
>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>>> file-based data sources, and catalog implementation should fail if the
>>> table has EXTERNAL or LOCATION property, but the table provider is not
>>> file-based.
>>>
>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as
>>> an external table. Hive gives warning when you create managed tables with
>>> custom location, which means this behavior is not recommended. Shall we
>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>>
>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue 
>>> wrote:
>>>
 Wenchen, why are you ignoring Hive as a “reasonable use case”?

 The keyword came from Hive and we all agree that a Hive catalog with
 Hive behavior can’t be implemented if Spark chooses to couple this with
 LOCATION. Why is this use case not a justification?

 Also, the option to keep behavior the same as before is not mutually
 exclusive with passing EXTERNAL to catalogs. Spark can continue to
 have the same behavior in its catalog. But Spark cannot just choose to
 break compatibility with external systems by deciding when to fail certain
 combinations of DDL options. Choosing not to allow external without
 location when it is valid for Hive prevents building a compatible catalog.

 There are many reasons to build a Hive-compatible catalog. A great
 recent example is Nessie 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table.

I think there is some confusion about what Spark has to handle. Regardless
of what Spark allows as DDL, these tables can exist in a Hive MetaStore
that Spark connects to, and the general expectation is that Spark doesn’t
change the meaning of table configuration. There are notable bugs where
Spark has different behavior, but that is the expectation.

In this particular case, we’re talking about what can be expressed in DDL
that is sent to an external catalog. Spark could (unwisely) choose to
disallow some DDL combinations, but the table is implemented through a
plugin so the interpretation is up to the plugin. Spark has no role in
choosing how to treat this table, unless it is loaded through Spark’s
built-in catalog; in which case, see above.

I don’t think Hive compatibility itself is a “use case”.

Why?

Hive is an external database that defines its own behavior and with which
Spark claims to be compatible. If Hive isn’t a valid use case, then why is
EXTERNAL supported at all?

On Wed, Oct 7, 2020 at 10:17 AM Holden Karau  wrote:

>
>
> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:
>
>> I don't think Hive compatibility itself is a "use case".
>>
> Ok let's add on top of this: I have some hive queries that I want to run
> on Spark. I believe that makes it a use case.
>
>> The Nessie  example you mentioned
>> is a reasonable use case to me: some frameworks/applications want to create
>> external tables without user-specified location, so that they can manage
>> the table directory themselves and implement fancy features.
>>
>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
>> should clearly document that, EXTERNAL and LOCATION are only applicable for
>> file-based data sources, and catalog implementation should fail if the
>> table has EXTERNAL or LOCATION property, but the table provider is not
>> file-based.
>>
>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
>> external table. Hive gives warning when you create managed tables with
>> custom location, which means this behavior is not recommended. Shall we
>> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>>
>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue 
>> wrote:
>>
>>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>>
>>> The keyword came from Hive and we all agree that a Hive catalog with
>>> Hive behavior can’t be implemented if Spark chooses to couple this with
>>> LOCATION. Why is this use case not a justification?
>>>
>>> Also, the option to keep behavior the same as before is not mutually
>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>>> the same behavior in its catalog. But Spark cannot just choose to break
>>> compatibility with external systems by deciding when to fail certain
>>> combinations of DDL options. Choosing not to allow external without
>>> location when it is valid for Hive prevents building a compatible catalog.
>>>
>>> There are many reasons to build a Hive-compatible catalog. A great
>>> recent example is Nessie , which
>>> enables branching and tagging table states across several table formats and
>>> aims to be compatible with Hive.
>>>
>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:
>>>
 > As someone who's had the job of porting different SQL dialects to
 Spark, I'm also very much in favor of keeping EXTERNAL

 Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
 we are discussing are:
 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
 with LOCATION (or path option).
 2. Always allow EXTERNAL, and decouple it with LOCATION.

 I'm fine with option 2 if there are reasonable use cases. I think it's
 always safer to keep the behavior the same as before. If we want to change
 the behavior and follow option 2, we need use cases to justify it.

 For now, the only use case I see is for Hive compatibility and allow
 EXTERNAL TABLE without user-specified LOCATION. Are there any more use
 cases we are targeting?

 On Wed, Oct 7, 2020 at 5:06 AM Holden Karau 
 wrote:

> As someone who's had the job of porting different SQL dialects to
> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
> suggestion of leaving it up to the catalogs on how to handle this makes
> sense.
>
> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
> wrote:
>
>> I would summarize both the problem and the current state differently.
>>
>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table 
>> with
>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>> breaks compatibility with 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Holden Karau
On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan  wrote:

> I don't think Hive compatibility itself is a "use case".
>
Ok let's add on top of this: I have some hive queries that I want to run on
Spark. I believe that makes it a use case.

> The Nessie  example you mentioned
> is a reasonable use case to me: some frameworks/applications want to create
> external tables without user-specified location, so that they can manage
> the table directory themselves and implement fancy features.
>
> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
> should clearly document that, EXTERNAL and LOCATION are only applicable for
> file-based data sources, and catalog implementation should fail if the
> table has EXTERNAL or LOCATION property, but the table provider is not
> file-based.
>
> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table. Hive gives warning when you create managed tables with
> custom location, which means this behavior is not recommended. Shall we
> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>
> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue 
> wrote:
>
>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>
>> The keyword came from Hive and we all agree that a Hive catalog with Hive
>> behavior can’t be implemented if Spark chooses to couple this with
>> LOCATION. Why is this use case not a justification?
>>
>> Also, the option to keep behavior the same as before is not mutually
>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>> the same behavior in its catalog. But Spark cannot just choose to break
>> compatibility with external systems by deciding when to fail certain
>> combinations of DDL options. Choosing not to allow external without
>> location when it is valid for Hive prevents building a compatible catalog.
>>
>> There are many reasons to build a Hive-compatible catalog. A great recent
>> example is Nessie , which enables
>> branching and tagging table states across several table formats and aims to
>> be compatible with Hive.
>>
>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:
>>
>>> > As someone who's had the job of porting different SQL dialects to
>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>
>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>>> we are discussing are:
>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>> with LOCATION (or path option).
>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>
>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>> always safer to keep the behavior the same as before. If we want to change
>>> the behavior and follow option 2, we need use cases to justify it.
>>>
>>> For now, the only use case I see is for Hive compatibility and allow
>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>> cases we are targeting?
>>>
>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau 
>>> wrote:
>>>
 As someone who's had the job of porting different SQL dialects to
 Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
 suggestion of leaving it up to the catalogs on how to handle this makes
 sense.

 On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
 wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with
> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature”
> breaks compatibility with Hive SQL* because all combinations of
> EXTERNAL and LOCATION are valid in Hive, but creating an external
> table with a default location is not allowed by Spark. Note that Spark 
> must
> still handle these tables because it shares a metastore with Hive, which
> can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the
> fact that EXTERNAL was in the CREATE TABLE statement to the v2
> catalog handling a create command, or to suppress it and apply Spark’s 
> rule
> that LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot
> implement the behavior of Hive SQL, even though Spark added the keyword 
> for
> Hive compatibility. The Spark catalog can interpret EXTERNAL however
> Spark chooses to, but I think it is a poor choice to force different
> behavior on other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish
> that goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
I don't think Hive compatibility itself is a "use case". The Nessie
 example you mentioned is a
reasonable use case to me: some frameworks/applications want to create
external tables without user-specified location, so that they can manage
the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
should clearly document that, EXTERNAL and LOCATION are only applicable for
file-based data sources, and catalog implementation should fail if the
table has EXTERNAL or LOCATION property, but the table provider is not
file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table. Hive gives warning when you create managed tables with
custom location, which means this behavior is not recommended. Shall we
"infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue  wrote:

> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>
> The keyword came from Hive and we all agree that a Hive catalog with Hive
> behavior can’t be implemented if Spark chooses to couple this with
> LOCATION. Why is this use case not a justification?
>
> Also, the option to keep behavior the same as before is not mutually
> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
> the same behavior in its catalog. But Spark cannot just choose to break
> compatibility with external systems by deciding when to fail certain
> combinations of DDL options. Choosing not to allow external without
> location when it is valid for Hive prevents building a compatible catalog.
>
> There are many reasons to build a Hive-compatible catalog. A great recent
> example is Nessie , which enables
> branching and tagging table states across several table formats and aims to
> be compatible with Hive.
>
> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:
>
>> > As someone who's had the job of porting different SQL dialects to
>> Spark, I'm also very much in favor of keeping EXTERNAL
>>
>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>> we are discussing are:
>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>> with LOCATION (or path option).
>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>
>> I'm fine with option 2 if there are reasonable use cases. I think it's
>> always safer to keep the behavior the same as before. If we want to change
>> the behavior and follow option 2, we need use cases to justify it.
>>
>> For now, the only use case I see is for Hive compatibility and allow
>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>> cases we are targeting?
>>
>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau  wrote:
>>
>>> As someone who's had the job of porting different SQL dialects to Spark,
>>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>> sense.
>>>
>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
>>> wrote:
>>>
 I would summarize both the problem and the current state differently.

 Currently, Spark parses the EXTERNAL keyword for compatibility with
 Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
 EXTERNAL unless LOCATION is also present. *This “hidden feature”
 breaks compatibility with Hive SQL* because all combinations of
 EXTERNAL and LOCATION are valid in Hive, but creating an external
 table with a default location is not allowed by Spark. Note that Spark must
 still handle these tables because it shares a metastore with Hive, which
 can still create them.

 Now catalogs can be plugged in, the question is whether to pass the
 fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
 handling a create command, or to suppress it and apply Spark’s rule that
 LOCATION must be present.

 If it is not passed to the catalog, then a Hive catalog cannot
 implement the behavior of Hive SQL, even though Spark added the keyword for
 Hive compatibility. The Spark catalog can interpret EXTERNAL however
 Spark chooses to, but I think it is a poor choice to force different
 behavior on other catalogs.

 Wenchen has also argued that the purpose of this is to standardize
 behavior across catalogs. But hiding EXTERNAL would not accomplish
 that goal. Whether to physically delete data is a choice that is up to the
 catalog. Some catalogs have no “external” concept and will always drop data
 when a table is dropped. The ability to keep underlying data files is
 specific to a few catalogs, and whether that is controlled by EXTERNAL,
 the LOCATION clause, or something else is still up to the catalog
 implementation.

 I don’t think that there is a good reason 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Ryan Blue
Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive
behavior can’t be implemented if Spark chooses to couple this with LOCATION.
Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually
exclusive with passing EXTERNAL to catalogs. Spark can continue to have the
same behavior in its catalog. But Spark cannot just choose to break
compatibility with external systems by deciding when to fail certain
combinations of DDL options. Choosing not to allow external without
location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent
example is Nessie , which enables
branching and tagging table states across several table formats and aims to
be compatible with Hive.

On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan  wrote:

> > As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL
>
> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
> are discussing are:
> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
> LOCATION (or path option).
> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>
> I'm fine with option 2 if there are reasonable use cases. I think it's
> always safer to keep the behavior the same as before. If we want to change
> the behavior and follow option 2, we need use cases to justify it.
>
> For now, the only use case I see is for Hive compatibility and allow
> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
> cases we are targeting?
>
> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau  wrote:
>
>> As someone who's had the job of porting different SQL dialects to Spark,
>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>> suggestion of leaving it up to the catalogs on how to handle this makes
>> sense.
>>
>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
>> wrote:
>>
>>> I would summarize both the problem and the current state differently.
>>>
>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>>> LOCATION are valid in Hive, but creating an external table with a
>>> default location is not allowed by Spark. Note that Spark must still handle
>>> these tables because it shares a metastore with Hive, which can still
>>> create them.
>>>
>>> Now catalogs can be plugged in, the question is whether to pass the fact
>>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>> handling a create command, or to suppress it and apply Spark’s rule that
>>> LOCATION must be present.
>>>
>>> If it is not passed to the catalog, then a Hive catalog cannot implement
>>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>>> chooses to, but I think it is a poor choice to force different behavior on
>>> other catalogs.
>>>
>>> Wenchen has also argued that the purpose of this is to standardize
>>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>>> goal. Whether to physically delete data is a choice that is up to the
>>> catalog. Some catalogs have no “external” concept and will always drop data
>>> when a table is dropped. The ability to keep underlying data files is
>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>> the LOCATION clause, or something else is still up to the catalog
>>> implementation.
>>>
>>> I don’t think that there is a good reason to force catalogs to break
>>> compatibility with Hive SQL, while making it appear as though DDL is
>>> compatible. Because removing EXTERNAL would be a breaking change to the
>>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>>> catalog can decide how to handle it.
>>>
>>> rb
>>>
>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:
>>>
 Hi all,

 I'd like to start a discussion thread about this topic, as it blocks an
 important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
 syntax.

 A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
 feature in Spark for Hive compatibility.

 When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
 TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
 can't be specified.

 When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
 if LOCATION clause or path option is present. For example, `CREATE
 EXTERNAL TABLE ... STORED AS parquet` is not allowed as 

Re: Official support of CREATE EXTERNAL TABLE

2020-10-07 Thread Wenchen Fan
> As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL

Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
are discussing are:
1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
LOCATION (or path option).
2. Always allow EXTERNAL, and decouple it with LOCATION.

I'm fine with option 2 if there are reasonable use cases. I think it's
always safer to keep the behavior the same as before. If we want to change
the behavior and follow option 2, we need use cases to justify it.

For now, the only use case I see is for Hive compatibility and allow
EXTERNAL TABLE without user-specified LOCATION. Are there any more use
cases we are targeting?

On Wed, Oct 7, 2020 at 5:06 AM Holden Karau  wrote:

> As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
> suggestion of leaving it up to the catalogs on how to handle this makes
> sense.
>
> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue 
> wrote:
>
>> I would summarize both the problem and the current state differently.
>>
>> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
>> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>> LOCATION are valid in Hive, but creating an external table with a
>> default location is not allowed by Spark. Note that Spark must still handle
>> these tables because it shares a metastore with Hive, which can still
>> create them.
>>
>> Now catalogs can be plugged in, the question is whether to pass the fact
>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>> handling a create command, or to suppress it and apply Spark’s rule that
>> LOCATION must be present.
>>
>> If it is not passed to the catalog, then a Hive catalog cannot implement
>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>> chooses to, but I think it is a poor choice to force different behavior on
>> other catalogs.
>>
>> Wenchen has also argued that the purpose of this is to standardize
>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>> goal. Whether to physically delete data is a choice that is up to the
>> catalog. Some catalogs have no “external” concept and will always drop data
>> when a table is dropped. The ability to keep underlying data files is
>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>> the LOCATION clause, or something else is still up to the catalog
>> implementation.
>>
>> I don’t think that there is a good reason to force catalogs to break
>> compatibility with Hive SQL, while making it appear as though DDL is
>> compatible. Because removing EXTERNAL would be a breaking change to the
>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>> catalog can decide how to handle it.
>>
>> rb
>>
>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start a discussion thread about this topic, as it blocks an
>>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>>> syntax.
>>>
>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>> feature in Spark for Hive compatibility.
>>>
>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>> TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
>>> can't be specified.
>>>
>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>>> LOCATION clause or path option is present. For example, `CREATE
>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>
>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>> was, or we can officially support it.
>>>
>>> Please let us know your thoughts:
>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>>> you used it in production before? For what use cases?
>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>> TABLE? It seems to me that it only makes sense for file source, as the
>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>> catalogs like jdbc, cassandra, etc.
>>>
>>> For more details, please refer to the long discussion in
>>> https://github.com/apache/spark/pull/28026
>>>
>>> Thanks,
>>> Wenchen
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  

Re: Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Russell Spitzer
I don't feel differently than I did on the thread linked above, I think
treating "External" as a table option is still the safest way to go about
things. For the Cassandra catalog this option wouldn't appear on our
whitelist of allowed options, the same as "path" and other options that
don't apply to C.

On Tue, Oct 6, 2020 at 3:54 PM Ryan Blue  wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
> compatibility with Hive SQL* because all combinations of EXTERNAL and
> LOCATION are valid in Hive, but creating an external table with a default
> location is not allowed by Spark. Note that Spark must still handle these
> tables because it shares a metastore with Hive, which can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the fact
> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
> handling a create command, or to suppress it and apply Spark’s rule that
> LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot implement
> the behavior of Hive SQL, even though Spark added the keyword for Hive
> compatibility. The Spark catalog can interpret EXTERNAL however Spark
> chooses to, but I think it is a poor choice to force different behavior on
> other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish that
> goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will always drop data
> when a table is dropped. The ability to keep underlying data files is
> specific to a few catalogs, and whether that is controlled by EXTERNAL,
> the LOCATION clause, or something else is still up to the catalog
> implementation.
>
> I don’t think that there is a good reason to force catalogs to break
> compatibility with Hive SQL, while making it appear as though DDL is
> compatible. Because removing EXTERNAL would be a breaking change to the
> SQL parser, I think the best option is to pass it to v2 catalogs so the
> catalog can decide how to handle it.
>
> rb
>
> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> I'd like to start a discussion thread about this topic, as it blocks an
>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>> syntax.
>>
>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>> feature in Spark for Hive compatibility.
>>
>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
>> ... USING parquet`, the parser fails and tells you that EXTERNAL can't
>> be specified.
>>
>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
>> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
>> clause or path option. This is not 100% Hive compatible.
>>
>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>> was, or we can officially support it.
>>
>> Please let us know your thoughts:
>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>> you used it in production before? For what use cases?
>> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
>> It seems to me that it only makes sense for file source, as the table
>> directory can be managed. I'm not sure how to interpret EXTERNAL in
>> catalogs like jdbc, cassandra, etc.
>>
>> For more details, please refer to the long discussion in
>> https://github.com/apache/spark/pull/28026
>>
>> Thanks,
>> Wenchen
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Holden Karau
As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
suggestion of leaving it up to the catalogs on how to handle this makes
sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue  wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
> compatibility with Hive SQL* because all combinations of EXTERNAL and
> LOCATION are valid in Hive, but creating an external table with a default
> location is not allowed by Spark. Note that Spark must still handle these
> tables because it shares a metastore with Hive, which can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the fact
> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
> handling a create command, or to suppress it and apply Spark’s rule that
> LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot implement
> the behavior of Hive SQL, even though Spark added the keyword for Hive
> compatibility. The Spark catalog can interpret EXTERNAL however Spark
> chooses to, but I think it is a poor choice to force different behavior on
> other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish that
> goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will always drop data
> when a table is dropped. The ability to keep underlying data files is
> specific to a few catalogs, and whether that is controlled by EXTERNAL,
> the LOCATION clause, or something else is still up to the catalog
> implementation.
>
> I don’t think that there is a good reason to force catalogs to break
> compatibility with Hive SQL, while making it appear as though DDL is
> compatible. Because removing EXTERNAL would be a breaking change to the
> SQL parser, I think the best option is to pass it to v2 catalogs so the
> catalog can decide how to handle it.
>
> rb
>
> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:
>
>> Hi all,
>>
>> I'd like to start a discussion thread about this topic, as it blocks an
>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>> syntax.
>>
>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>> feature in Spark for Hive compatibility.
>>
>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
>> ... USING parquet`, the parser fails and tells you that EXTERNAL can't
>> be specified.
>>
>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
>> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
>> clause or path option. This is not 100% Hive compatible.
>>
>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>> was, or we can officially support it.
>>
>> Please let us know your thoughts:
>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>> you used it in production before? For what use cases?
>> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
>> It seems to me that it only makes sense for file source, as the table
>> directory can be managed. I'm not sure how to interpret EXTERNAL in
>> catalogs like jdbc, cassandra, etc.
>>
>> For more details, please refer to the long discussion in
>> https://github.com/apache/spark/pull/28026
>>
>> Thanks,
>> Wenchen
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Official support of CREATE EXTERNAL TABLE

2020-10-06 Thread Ryan Blue
I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
SQL, but Spark’s built-in catalog doesn’t allow creating a table with
EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
compatibility with Hive SQL* because all combinations of EXTERNAL and
LOCATION are valid in Hive, but creating an external table with a default
location is not allowed by Spark. Note that Spark must still handle these
tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact
that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling
a create command, or to suppress it and apply Spark’s rule that LOCATION
must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement
the behavior of Hive SQL, even though Spark added the keyword for Hive
compatibility. The Spark catalog can interpret EXTERNAL however Spark
chooses to, but I think it is a poor choice to force different behavior on
other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior
across catalogs. But hiding EXTERNAL would not accomplish that goal.
Whether to physically delete data is a choice that is up to the catalog.
Some catalogs have no “external” concept and will always drop data when a
table is dropped. The ability to keep underlying data files is specific to
a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION
clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break
compatibility with Hive SQL, while making it appear as though DDL is
compatible. Because removing EXTERNAL would be a breaking change to the SQL
parser, I think the best option is to pass it to v2 catalogs so the catalog
can decide how to handle it.

rb

On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan  wrote:

> Hi all,
>
> I'd like to start a discussion thread about this topic, as it blocks an
> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
> syntax.
>
> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
> feature in Spark for Hive compatibility.
>
> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
> ... USING parquet`, the parser fails and tells you that EXTERNAL can't be
> specified.
>
> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
> clause or path option. This is not 100% Hive compatible.
>
> As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal
> with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was,
> or we can officially support it.
>
> Please let us know your thoughts:
> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
> you used it in production before? For what use cases?
> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
> It seems to me that it only makes sense for file source, as the table
> directory can be managed. I'm not sure how to interpret EXTERNAL in
> catalogs like jdbc, cassandra, etc.
>
> For more details, please refer to the long discussion in
> https://github.com/apache/spark/pull/28026
>
> Thanks,
> Wenchen
>


-- 
Ryan Blue
Software Engineer
Netflix