Re: DSv2 sync notes - 21 August 2019

Russell Spitzer Fri, 30 Aug 2019 16:43:31 -0700

Sorry all, I'm out on vacation till the week after next, I'll chime in then
unless we need consensus sooner


On Fri, Aug 30, 2019, 4:05 PM Ryan Blue <[email protected]> wrote:

> Sorry these notes were delayed. Here’s what we talked about in the last
> DSv2 sync.
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Burak Yavuz
> Gengliang Wang
> Terry Kim
> Wenchen Fan
> Xin Ren
> Srabasti Banerjee
> Priyanka Gomatam
>
> *Topics*:
>
>    - Follow up on renaming append to insert in v2 API
>    - Changes to CatalogPlugin for v2 session catalog implementations
>    - Check on blockers
>       - Remove SaveMode - remove special case after file sources are
>       disabled?
>       - Reorganize packages
>    - Open PRs
>       - DataFrameWriterV2: https://github.com/apache/spark/pull/25354
>       - SHOW TABLES: https://github.com/apache/spark/pull/25247
>       - https://github.com/apache/spark/pull/25507
>
> *Discussion*:
>
>    - Insert in DataFrameWriter v2 API:
>       - Ryan: After reviewing the doc that Russel sent
>       <https://en.wikipedia.org/wiki/Merge_(SQL>) last time, it doesn’t
>       look like there is precedent for insert implementing upsert, without an
>       additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT ....
>       I think that means that insert should not be used for upsert and it is
>       correct to use the verb “append” in the new API.
>       - Wenchen: Spark already supports INSERT OVERWRITE that has no
>       precedent other than Hive
>       - Ryan: Good point. INSERT OVERWRITE is a partition-level replace.
>       If we think of single-key stores as partitioned by row key, then the
>       dynamic INSERT OVERWRITE behavior is appropriate.
>       - Ryan: One other reason to change “append” to “insert” is to match
>       SQL. Should we consider renaming for consistency with SQL?
>       - Burak: SQL inserts are by position and DataFrameWriterV2 appends
>       are by name. DataFrameWriter (v1) uses position for insertInto, so 
> there is
>       a precedent that insert is by position.
>       - Ryan: I agree with that logic. It makes more sense to use append
>       to distinguish behavior.
>       - Consensus was to keep the “append” verb, but will discuss when
>       Russel is back.
>       - Burak: (Continuing from DataFrameWriterV2 discussion) The v2
>       writer looks fine other than the partition functions are close to 
> built-in
>       expression functions (year vs years).
>       - Consensus was to use “partitioning.years” for partitioning
>       functions.
>    - Changes to CatalogPlugin for v2 session catalog implementations
>       - Wenchen: this adds a new config for overriding v2 session
>       catalog, and a new abstract class that must be implemented
>       - Ryan: Why a new config? If we intend for a user to be able to
>       override this, then we already have a mechanism to configure it using 
> the
>       “session” catalog name.
>       - Discussion on pros and cons of using a different config,
>       consensus was to use the existing CatalogPlugin config
>       - Ryan: Looks like this uses TableCatalog for the actual API and
>       passes in the built-in V2SessionCatalog. That sounds like a good idea to
>       me, instead of introducing a new API.
>       - Burak: What about databases named “session”?
>       - Ryan: Catalogs take precedence over databases, so the session
>       catalog will be used for “session.table”.
>       - Burak: Sounds like this is going to break existing queries then.
>       - Ryan: I think the general rule that catalogs should take
>       precedence is right. It would be worse to allow users creating 
> databases to
>       break other users’ catalogs — we avoid that problem with catalogs 
> because
>       they are limited to jobs when users create them and are otherwise an
>       administrator option. But, session is a special case because this is 
> Spark
>       building a catalog into all environments… I think that’s a good reason 
> to
>       name it something that we think is unlikely to conflict.
>       - Discussion came up with several alternatives (*session*,
>       built_in, etc) but consensus settled on “spark_catalog”. That’s more
>       descriptive and much less likely to conflict with existing databases.
>    - Remove SaveMode: this was done by Wenchen’s commit that disabled
>    file sources.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DSv2 sync notes - 21 August 2019

Reply via email to