As someone who's had the job of porting different SQL dialects to Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's suggestion of leaving it up to the catalogs on how to handle this makes sense.
On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > I would summarize both the problem and the current state differently. > > Currently, Spark parses the EXTERNAL keyword for compatibility with Hive > SQL, but Spark’s built-in catalog doesn’t allow creating a table with > EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks > compatibility with Hive SQL* because all combinations of EXTERNAL and > LOCATION are valid in Hive, but creating an external table with a default > location is not allowed by Spark. Note that Spark must still handle these > tables because it shares a metastore with Hive, which can still create them. > > Now catalogs can be plugged in, the question is whether to pass the fact > that EXTERNAL was in the CREATE TABLE statement to the v2 catalog > handling a create command, or to suppress it and apply Spark’s rule that > LOCATION must be present. > > If it is not passed to the catalog, then a Hive catalog cannot implement > the behavior of Hive SQL, even though Spark added the keyword for Hive > compatibility. The Spark catalog can interpret EXTERNAL however Spark > chooses to, but I think it is a poor choice to force different behavior on > other catalogs. > > Wenchen has also argued that the purpose of this is to standardize > behavior across catalogs. But hiding EXTERNAL would not accomplish that > goal. Whether to physically delete data is a choice that is up to the > catalog. Some catalogs have no “external” concept and will always drop data > when a table is dropped. The ability to keep underlying data files is > specific to a few catalogs, and whether that is controlled by EXTERNAL, > the LOCATION clause, or something else is still up to the catalog > implementation. > > I don’t think that there is a good reason to force catalogs to break > compatibility with Hive SQL, while making it appear as though DDL is > compatible. Because removing EXTERNAL would be a breaking change to the > SQL parser, I think the best option is to pass it to v2 catalogs so the > catalog can decide how to handle it. > > rb > > On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Hi all, >> >> I'd like to start a discussion thread about this topic, as it blocks an >> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL >> syntax. >> >> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden >> feature in Spark for Hive compatibility. >> >> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE >> ... USING parquet`, the parser fails and tells you that EXTERNAL can't >> be specified. >> >> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if >> LOCATION clause or path option is present. For example, `CREATE EXTERNAL >> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION >> clause or path option. This is not 100% Hive compatible. >> >> As we are unifying the CREATE TABLE SQL syntax, one problem is how to >> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it >> was, or we can officially support it. >> >> Please let us know your thoughts: >> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have >> you used it in production before? For what use cases? >> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE? >> It seems to me that it only makes sense for file source, as the table >> directory can be managed. I'm not sure how to interpret EXTERNAL in >> catalogs like jdbc, cassandra, etc. >> >> For more details, please refer to the long discussion in >> https://github.com/apache/spark/pull/28026 >> >> Thanks, >> Wenchen >> > > > -- > Ryan Blue > Software Engineer > Netflix > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau