Sorry these notes were delayed. Here’s what we talked about in the last DSv2 sync.
*Attendees*: Ryan Blue John Zhuge Burak Yavuz Gengliang Wang Terry Kim Wenchen Fan Xin Ren Srabasti Banerjee Priyanka Gomatam *Topics*: - Follow up on renaming append to insert in v2 API - Changes to CatalogPlugin for v2 session catalog implementations - Check on blockers - Remove SaveMode - remove special case after file sources are disabled? - Reorganize packages - Open PRs - DataFrameWriterV2: https://github.com/apache/spark/pull/25354 - SHOW TABLES: https://github.com/apache/spark/pull/25247 - https://github.com/apache/spark/pull/25507 *Discussion*: - Insert in DataFrameWriter v2 API: - Ryan: After reviewing the doc that Russel sent <https://en.wikipedia.org/wiki/Merge_(SQL>) last time, it doesn’t look like there is precedent for insert implementing upsert, without an additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... I think that means that insert should not be used for upsert and it is correct to use the verb “append” in the new API. - Wenchen: Spark already supports INSERT OVERWRITE that has no precedent other than Hive - Ryan: Good point. INSERT OVERWRITE is a partition-level replace. If we think of single-key stores as partitioned by row key, then the dynamic INSERT OVERWRITE behavior is appropriate. - Ryan: One other reason to change “append” to “insert” is to match SQL. Should we consider renaming for consistency with SQL? - Burak: SQL inserts are by position and DataFrameWriterV2 appends are by name. DataFrameWriter (v1) uses position for insertInto, so there is a precedent that insert is by position. - Ryan: I agree with that logic. It makes more sense to use append to distinguish behavior. - Consensus was to keep the “append” verb, but will discuss when Russel is back. - Burak: (Continuing from DataFrameWriterV2 discussion) The v2 writer looks fine other than the partition functions are close to built-in expression functions (year vs years). - Consensus was to use “partitioning.years” for partitioning functions. - Changes to CatalogPlugin for v2 session catalog implementations - Wenchen: this adds a new config for overriding v2 session catalog, and a new abstract class that must be implemented - Ryan: Why a new config? If we intend for a user to be able to override this, then we already have a mechanism to configure it using the “session” catalog name. - Discussion on pros and cons of using a different config, consensus was to use the existing CatalogPlugin config - Ryan: Looks like this uses TableCatalog for the actual API and passes in the built-in V2SessionCatalog. That sounds like a good idea to me, instead of introducing a new API. - Burak: What about databases named “session”? - Ryan: Catalogs take precedence over databases, so the session catalog will be used for “session.table”. - Burak: Sounds like this is going to break existing queries then. - Ryan: I think the general rule that catalogs should take precedence is right. It would be worse to allow users creating databases to break other users’ catalogs — we avoid that problem with catalogs because they are limited to jobs when users create them and are otherwise an administrator option. But, session is a special case because this is Spark building a catalog into all environments… I think that’s a good reason to name it something that we think is unlikely to conflict. - Discussion came up with several alternatives (*session*, built_in, etc) but consensus settled on “spark_catalog”. That’s more descriptive and much less likely to conflict with existing databases. - Remove SaveMode: this was done by Wenchen’s commit that disabled file sources. -- Ryan Blue Software Engineer Netflix