DSv2 sync notes - 21 August 2019

Ryan Blue Fri, 30 Aug 2019 14:06:21 -0700

Sorry these notes were delayed. Here’s what we talked about in the last
DSv2 sync.


*Attendees*:

Ryan Blue
John Zhuge
Burak Yavuz
Gengliang Wang
Terry Kim
Wenchen Fan
Xin Ren
Srabasti Banerjee
Priyanka Gomatam

*Topics*:

   - Follow up on renaming append to insert in v2 API
   - Changes to CatalogPlugin for v2 session catalog implementations
   - Check on blockers
      - Remove SaveMode - remove special case after file sources are
      disabled?
      - Reorganize packages
   - Open PRs
      - DataFrameWriterV2: https://github.com/apache/spark/pull/25354
      - SHOW TABLES: https://github.com/apache/spark/pull/25247
      - https://github.com/apache/spark/pull/25507

*Discussion*:

   - Insert in DataFrameWriter v2 API:
      - Ryan: After reviewing the doc that Russel sent
      <https://en.wikipedia.org/wiki/Merge_(SQL>) last time, it doesn’t
      look like there is precedent for insert implementing upsert, without an
      additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... I
      think that means that insert should not be used for upsert and it is
      correct to use the verb “append” in the new API.
      - Wenchen: Spark already supports INSERT OVERWRITE that has no
      precedent other than Hive
      - Ryan: Good point. INSERT OVERWRITE is a partition-level replace. If
      we think of single-key stores as partitioned by row key, then
the dynamic INSERT
      OVERWRITE behavior is appropriate.
      - Ryan: One other reason to change “append” to “insert” is to match
      SQL. Should we consider renaming for consistency with SQL?
      - Burak: SQL inserts are by position and DataFrameWriterV2 appends
      are by name. DataFrameWriter (v1) uses position for insertInto,
so there is
      a precedent that insert is by position.
      - Ryan: I agree with that logic. It makes more sense to use append to
      distinguish behavior.
      - Consensus was to keep the “append” verb, but will discuss when
      Russel is back.
      - Burak: (Continuing from DataFrameWriterV2 discussion) The v2 writer
      looks fine other than the partition functions are close to built-in
      expression functions (year vs years).
      - Consensus was to use “partitioning.years” for partitioning
      functions.
   - Changes to CatalogPlugin for v2 session catalog implementations
      - Wenchen: this adds a new config for overriding v2 session catalog,
      and a new abstract class that must be implemented
      - Ryan: Why a new config? If we intend for a user to be able to
      override this, then we already have a mechanism to configure it using the
      “session” catalog name.
      - Discussion on pros and cons of using a different config, consensus
      was to use the existing CatalogPlugin config
      - Ryan: Looks like this uses TableCatalog for the actual API and
      passes in the built-in V2SessionCatalog. That sounds like a good idea to
      me, instead of introducing a new API.
      - Burak: What about databases named “session”?
      - Ryan: Catalogs take precedence over databases, so the session
      catalog will be used for “session.table”.
      - Burak: Sounds like this is going to break existing queries then.
      - Ryan: I think the general rule that catalogs should take precedence
      is right. It would be worse to allow users creating databases to break
      other users’ catalogs — we avoid that problem with catalogs because they
      are limited to jobs when users create them and are otherwise an
      administrator option. But, session is a special case because
this is Spark
      building a catalog into all environments… I think that’s a good reason to
      name it something that we think is unlikely to conflict.
      - Discussion came up with several alternatives (*session*, built_in,
      etc) but consensus settled on “spark_catalog”. That’s more
descriptive and
      much less likely to conflict with existing databases.
   - Remove SaveMode: this was done by Wenchen’s commit that disabled file
   sources.

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync notes - 21 August 2019

Reply via email to