Wenchen, why are you ignoring Hive as a “reasonable use case”?

The keyword came from Hive and we all agree that a Hive catalog with Hive
behavior can’t be implemented if Spark chooses to couple this with LOCATION.
Why is this use case not a justification?

Also, the option to keep behavior the same as before is not mutually
exclusive with passing EXTERNAL to catalogs. Spark can continue to have the
same behavior in its catalog. But Spark cannot just choose to break
compatibility with external systems by deciding when to fail certain
combinations of DDL options. Choosing not to allow external without
location when it is valid for Hive prevents building a compatible catalog.

There are many reasons to build a Hive-compatible catalog. A great recent
example is Nessie <https://projectnessie.org/tools/hive/>, which enables
branching and tagging table states across several table formats and aims to
be compatible with Hive.

On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> > As someone who's had the job of porting different SQL dialects to Spark,
> I'm also very much in favor of keeping EXTERNAL
>
> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options we
> are discussing are:
> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists with
> LOCATION (or path option).
> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>
> I'm fine with option 2 if there are reasonable use cases. I think it's
> always safer to keep the behavior the same as before. If we want to change
> the behavior and follow option 2, we need use cases to justify it.
>
> For now, the only use case I see is for Hive compatibility and allow
> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
> cases we are targeting?
>
> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> As someone who's had the job of porting different SQL dialects to Spark,
>> I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>> suggestion of leaving it up to the catalogs on how to handle this makes
>> sense.
>>
>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> I would summarize both the problem and the current state differently.
>>>
>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
>>> compatibility with Hive SQL* because all combinations of EXTERNAL and
>>> LOCATION are valid in Hive, but creating an external table with a
>>> default location is not allowed by Spark. Note that Spark must still handle
>>> these tables because it shares a metastore with Hive, which can still
>>> create them.
>>>
>>> Now catalogs can be plugged in, the question is whether to pass the fact
>>> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
>>> handling a create command, or to suppress it and apply Spark’s rule that
>>> LOCATION must be present.
>>>
>>> If it is not passed to the catalog, then a Hive catalog cannot implement
>>> the behavior of Hive SQL, even though Spark added the keyword for Hive
>>> compatibility. The Spark catalog can interpret EXTERNAL however Spark
>>> chooses to, but I think it is a poor choice to force different behavior on
>>> other catalogs.
>>>
>>> Wenchen has also argued that the purpose of this is to standardize
>>> behavior across catalogs. But hiding EXTERNAL would not accomplish that
>>> goal. Whether to physically delete data is a choice that is up to the
>>> catalog. Some catalogs have no “external” concept and will always drop data
>>> when a table is dropped. The ability to keep underlying data files is
>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>> the LOCATION clause, or something else is still up to the catalog
>>> implementation.
>>>
>>> I don’t think that there is a good reason to force catalogs to break
>>> compatibility with Hive SQL, while making it appear as though DDL is
>>> compatible. Because removing EXTERNAL would be a breaking change to the
>>> SQL parser, I think the best option is to pass it to v2 catalogs so the
>>> catalog can decide how to handle it.
>>>
>>> rb
>>>
>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to start a discussion thread about this topic, as it blocks an
>>>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>>>> syntax.
>>>>
>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>>>> feature in Spark for Hive compatibility.
>>>>
>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>> TABLE ... USING parquet`, the parser fails and tells you that EXTERNAL
>>>> can't be specified.
>>>>
>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>
>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>> was, or we can officially support it.
>>>>
>>>> Please let us know your thoughts:
>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>>>> you used it in production before? For what use cases?
>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>> catalogs like jdbc, cassandra, etc.
>>>>
>>>> For more details, please refer to the long discussion in
>>>> https://github.com/apache/spark/pull/28026
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to