On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> I don't think Hive compatibility itself is a "use case".
>
Ok let's add on top of this: I have some hive queries that I want to run on
Spark. I believe that makes it a use case.

> The Nessie <https://projectnessie.org/tools/hive/> example you mentioned
> is a reasonable use case to me: some frameworks/applications want to create
> external tables without user-specified location, so that they can manage
> the table directory themselves and implement fancy features.
>
> That said, now I agree it's better to decouple EXTERNAL and LOCATION. We
> should clearly document that, EXTERNAL and LOCATION are only applicable for
> file-based data sources, and catalog implementation should fail if the
> table has EXTERNAL or LOCATION property, but the table provider is not
> file-based.
>
> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an
> external table. Hive gives warning when you create managed tables with
> custom location, which means this behavior is not recommended. Shall we
> "infer" EXTERNAL from LOCATION although it's not Hive compatible?
>
> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Wenchen, why are you ignoring Hive as a “reasonable use case”?
>>
>> The keyword came from Hive and we all agree that a Hive catalog with Hive
>> behavior can’t be implemented if Spark chooses to couple this with
>> LOCATION. Why is this use case not a justification?
>>
>> Also, the option to keep behavior the same as before is not mutually
>> exclusive with passing EXTERNAL to catalogs. Spark can continue to have
>> the same behavior in its catalog. But Spark cannot just choose to break
>> compatibility with external systems by deciding when to fail certain
>> combinations of DDL options. Choosing not to allow external without
>> location when it is valid for Hive prevents building a compatible catalog.
>>
>> There are many reasons to build a Hive-compatible catalog. A great recent
>> example is Nessie <https://projectnessie.org/tools/hive/>, which enables
>> branching and tagging table states across several table formats and aims to
>> be compatible with Hive.
>>
>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> > As someone who's had the job of porting different SQL dialects to
>>> Spark, I'm also very much in favor of keeping EXTERNAL
>>>
>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options
>>> we are discussing are:
>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists
>>> with LOCATION (or path option).
>>> 2. Always allow EXTERNAL, and decouple it with LOCATION.
>>>
>>> I'm fine with option 2 if there are reasonable use cases. I think it's
>>> always safer to keep the behavior the same as before. If we want to change
>>> the behavior and follow option 2, we need use cases to justify it.
>>>
>>> For now, the only use case I see is for Hive compatibility and allow
>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use
>>> cases we are targeting?
>>>
>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> As someone who's had the job of porting different SQL dialects to
>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
>>>> suggestion of leaving it up to the catalogs on how to handle this makes
>>>> sense.
>>>>
>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> I would summarize both the problem and the current state differently.
>>>>>
>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with
>>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with
>>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature”
>>>>> breaks compatibility with Hive SQL* because all combinations of
>>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external
>>>>> table with a default location is not allowed by Spark. Note that Spark 
>>>>> must
>>>>> still handle these tables because it shares a metastore with Hive, which
>>>>> can still create them.
>>>>>
>>>>> Now catalogs can be plugged in, the question is whether to pass the
>>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2
>>>>> catalog handling a create command, or to suppress it and apply Spark’s 
>>>>> rule
>>>>> that LOCATION must be present.
>>>>>
>>>>> If it is not passed to the catalog, then a Hive catalog cannot
>>>>> implement the behavior of Hive SQL, even though Spark added the keyword 
>>>>> for
>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however
>>>>> Spark chooses to, but I think it is a poor choice to force different
>>>>> behavior on other catalogs.
>>>>>
>>>>> Wenchen has also argued that the purpose of this is to standardize
>>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish
>>>>> that goal. Whether to physically delete data is a choice that is up to the
>>>>> catalog. Some catalogs have no “external” concept and will always drop 
>>>>> data
>>>>> when a table is dropped. The ability to keep underlying data files is
>>>>> specific to a few catalogs, and whether that is controlled by EXTERNAL,
>>>>> the LOCATION clause, or something else is still up to the catalog
>>>>> implementation.
>>>>>
>>>>> I don’t think that there is a good reason to force catalogs to break
>>>>> compatibility with Hive SQL, while making it appear as though DDL is
>>>>> compatible. Because removing EXTERNAL would be a breaking change to
>>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so 
>>>>> the
>>>>> catalog can decide how to handle it.
>>>>>
>>>>> rb
>>>>>
>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to start a discussion thread about this topic, as it blocks
>>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE
>>>>>> SQL syntax.
>>>>>>
>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a
>>>>>> hidden feature in Spark for Hive compatibility.
>>>>>>
>>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL
>>>>>> TABLE ... USING parquet`, the parser fails and tells you that
>>>>>> EXTERNAL can't be specified.
>>>>>>
>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified
>>>>>> if LOCATION clause or path option is present. For example, `CREATE
>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no
>>>>>> LOCATION clause or path option. This is not 100% Hive compatible.
>>>>>>
>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>>>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>>>>>> was, or we can officially support it.
>>>>>>
>>>>>> Please let us know your thoughts:
>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do?
>>>>>> Have you used it in production before? For what use cases?
>>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL
>>>>>> TABLE? It seems to me that it only makes sense for file source, as the
>>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in
>>>>>> catalogs like jdbc, cassandra, etc.
>>>>>>
>>>>>> For more details, please refer to the long discussion in
>>>>>> https://github.com/apache/spark/pull/28026
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to