As someone who's had the job of porting different SQL dialects to Spark,
I'm also very much in favor of keeping EXTERNAL, and I think Ryan's
suggestion of leaving it up to the catalogs on how to handle this makes
sense.

On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I would summarize both the problem and the current state differently.
>
> Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
> SQL, but Spark’s built-in catalog doesn’t allow creating a table with
> EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
> compatibility with Hive SQL* because all combinations of EXTERNAL and
> LOCATION are valid in Hive, but creating an external table with a default
> location is not allowed by Spark. Note that Spark must still handle these
> tables because it shares a metastore with Hive, which can still create them.
>
> Now catalogs can be plugged in, the question is whether to pass the fact
> that EXTERNAL was in the CREATE TABLE statement to the v2 catalog
> handling a create command, or to suppress it and apply Spark’s rule that
> LOCATION must be present.
>
> If it is not passed to the catalog, then a Hive catalog cannot implement
> the behavior of Hive SQL, even though Spark added the keyword for Hive
> compatibility. The Spark catalog can interpret EXTERNAL however Spark
> chooses to, but I think it is a poor choice to force different behavior on
> other catalogs.
>
> Wenchen has also argued that the purpose of this is to standardize
> behavior across catalogs. But hiding EXTERNAL would not accomplish that
> goal. Whether to physically delete data is a choice that is up to the
> catalog. Some catalogs have no “external” concept and will always drop data
> when a table is dropped. The ability to keep underlying data files is
> specific to a few catalogs, and whether that is controlled by EXTERNAL,
> the LOCATION clause, or something else is still up to the catalog
> implementation.
>
> I don’t think that there is a good reason to force catalogs to break
> compatibility with Hive SQL, while making it appear as though DDL is
> compatible. Because removing EXTERNAL would be a breaking change to the
> SQL parser, I think the best option is to pass it to v2 catalogs so the
> catalog can decide how to handle it.
>
> rb
>
> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'd like to start a discussion thread about this topic, as it blocks an
>> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
>> syntax.
>>
>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
>> feature in Spark for Hive compatibility.
>>
>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
>> ... USING parquet`, the parser fails and tells you that EXTERNAL can't
>> be specified.
>>
>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
>> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
>> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
>> clause or path option. This is not 100% Hive compatible.
>>
>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to
>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it
>> was, or we can officially support it.
>>
>> Please let us know your thoughts:
>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
>> you used it in production before? For what use cases?
>> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
>> It seems to me that it only makes sense for file source, as the table
>> directory can be managed. I'm not sure how to interpret EXTERNAL in
>> catalogs like jdbc, cassandra, etc.
>>
>> For more details, please refer to the long discussion in
>> https://github.com/apache/spark/pull/28026
>>
>> Thanks,
>> Wenchen
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Reply via email to