Sorry all, I'm out on vacation till the week after next, I'll chime in then unless we need consensus sooner
On Fri, Aug 30, 2019, 4:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Sorry these notes were delayed. Here’s what we talked about in the last > DSv2 sync. > > *Attendees*: > > Ryan Blue > John Zhuge > Burak Yavuz > Gengliang Wang > Terry Kim > Wenchen Fan > Xin Ren > Srabasti Banerjee > Priyanka Gomatam > > *Topics*: > > - Follow up on renaming append to insert in v2 API > - Changes to CatalogPlugin for v2 session catalog implementations > - Check on blockers > - Remove SaveMode - remove special case after file sources are > disabled? > - Reorganize packages > - Open PRs > - DataFrameWriterV2: https://github.com/apache/spark/pull/25354 > - SHOW TABLES: https://github.com/apache/spark/pull/25247 > - https://github.com/apache/spark/pull/25507 > > *Discussion*: > > - Insert in DataFrameWriter v2 API: > - Ryan: After reviewing the doc that Russel sent > <https://en.wikipedia.org/wiki/Merge_(SQL>) last time, it doesn’t > look like there is precedent for insert implementing upsert, without an > additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... > I think that means that insert should not be used for upsert and it is > correct to use the verb “append” in the new API. > - Wenchen: Spark already supports INSERT OVERWRITE that has no > precedent other than Hive > - Ryan: Good point. INSERT OVERWRITE is a partition-level replace. > If we think of single-key stores as partitioned by row key, then the > dynamic INSERT OVERWRITE behavior is appropriate. > - Ryan: One other reason to change “append” to “insert” is to match > SQL. Should we consider renaming for consistency with SQL? > - Burak: SQL inserts are by position and DataFrameWriterV2 appends > are by name. DataFrameWriter (v1) uses position for insertInto, so > there is > a precedent that insert is by position. > - Ryan: I agree with that logic. It makes more sense to use append > to distinguish behavior. > - Consensus was to keep the “append” verb, but will discuss when > Russel is back. > - Burak: (Continuing from DataFrameWriterV2 discussion) The v2 > writer looks fine other than the partition functions are close to > built-in > expression functions (year vs years). > - Consensus was to use “partitioning.years” for partitioning > functions. > - Changes to CatalogPlugin for v2 session catalog implementations > - Wenchen: this adds a new config for overriding v2 session > catalog, and a new abstract class that must be implemented > - Ryan: Why a new config? If we intend for a user to be able to > override this, then we already have a mechanism to configure it using > the > “session” catalog name. > - Discussion on pros and cons of using a different config, > consensus was to use the existing CatalogPlugin config > - Ryan: Looks like this uses TableCatalog for the actual API and > passes in the built-in V2SessionCatalog. That sounds like a good idea to > me, instead of introducing a new API. > - Burak: What about databases named “session”? > - Ryan: Catalogs take precedence over databases, so the session > catalog will be used for “session.table”. > - Burak: Sounds like this is going to break existing queries then. > - Ryan: I think the general rule that catalogs should take > precedence is right. It would be worse to allow users creating > databases to > break other users’ catalogs — we avoid that problem with catalogs > because > they are limited to jobs when users create them and are otherwise an > administrator option. But, session is a special case because this is > Spark > building a catalog into all environments… I think that’s a good reason > to > name it something that we think is unlikely to conflict. > - Discussion came up with several alternatives (*session*, > built_in, etc) but consensus settled on “spark_catalog”. That’s more > descriptive and much less likely to conflict with existing databases. > - Remove SaveMode: this was done by Wenchen’s commit that disabled > file sources. > > -- > Ryan Blue > Software Engineer > Netflix >