I don't think Hive compatibility itself is a "use case". The Nessie
<https://projectnessie.org/tools/hive/> example you mentioned is a
reasonable use case to me: some frameworks/applications want to create
external tables without user-specified location, so that they can manage
the table directory themselves and implement fancy features.

That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
should clearly document that, EXTERNAL and LOCATION are only applicable for
file-based data sources, and catalog implementation should fail if the
table has EXTERNAL or LOCATION property, but the table provider is not
file-based.

BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
external table. Hive gives warning when you create managed tables with
custom location, which means this behavior is not recommended. Shall we
"infer" EXTERNAL from LOCATION although it's not Hive compatible?

On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>
> The keyword came from Hive and we all agree that a Hive catalog with Hive
> behavior can’t be implemented if Spark chooses to couple this with
> LOCATION. Why is this use case not a justification?
>
> Also, the option to keep behavior the same as before is not mutually
> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
> the same behavior in its catalog. But Spark cannot just choose to break
> compatibility with external systems by deciding when to fail certain
> combinations of DDL options. Choosing not to allow external without
> location when it is valid for Hive prevents building a compatible catalog.
>
> There are many reasons to build a Hive-compatible catalog. A great recent
> example is Nessie <https://projectnessie.org/tools/hive/>, which enables
> branching and tagging table states across several table formats and aims to
> be compatible with Hive.
>
> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> > As someone who's had the job of porting different SQL dialects to
>> Spark, I'm also very much in favor of keeping EXTERNAL
>>
>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>> we are discussing are:
>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>> with LOCATION (or path option).
>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>
>> I'm fine with option 2 if there are reasonable use cases. I think it's
>> always safer to keep the behavior the same as before. If we want to change
>> the behavior and follow option 2, we need use cases to justify it.
>>
>> For now, the only use case I see is for Hive compatibility and allow
>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>> cases we are targeting?
>>
>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <hol...@pigscanfly.ca> wrote:
>>
>>> As someone who's had the job of porting different SQL dialects to Spark,
>>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>> sense.
>>>
>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> I would summarize both the problem and the current state differently.
>>>>
>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>> breaks compatibility with Hive SQL* because all combinations of
>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>> table with a default location is not allowed by Spark. Note that Spark must
>>>> still handle these tables because it shares a metastore with Hive, which
>>>> can still create them.
>>>>
>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>>> handling a create command, or to suppress it and apply Spark’s rule that
>>>> LOCATION must be present.
>>>>
>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>> implement the behavior of Hive SQL, even though Spark added the keyword for
>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however
>>>> Spark chooses to, but I think it is a poor choice to force different
>>>> behavior on other catalogs.
>>>>
>>>> Wenchen has also argued that the purpose of this is to standardize
>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>> catalog. Some catalogs have no “external” concept and will always drop data
>>>> when a table is dropped. The ability to keep underlying data files is
>>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>>> the LOCATION clause, or something else is still up to the catalog
>>>> implementation.
>>>>
>>>> I don’t think that there is a good reason to force catalogs to break
>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so the
>>>> catalog can decide how to handle it.
>>>>
>>>> rb
>>>>
>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to start a discussion thread about this topic, as it blocks
>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE
>>>>> SQL syntax.
>>>>>
>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>>>> feature in Spark for Hive compatibility.
>>>>>
>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>> EXTERNAL can't be specified.
>>>>>
>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>
>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>>> was, or we can officially support it.
>>>>>
>>>>> Please let us know your thoughts:
>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>> Have you used it in production before? For what use cases?
>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>> catalogs like jdbc, cassandra, etc.
>>>>>
>>>>> For more details, please refer to the long discussion in
>>>>> https://github.com/apache/spark/pull/28026
>>>>>
>>>>> Thanks,
>>>>> Wenchen
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to