Sorry these notes were delayed. Here’s what we talked about in the last
DSv2 sync.
*Attendees*:
Ryan Blue
John Zhuge
Burak Yavuz
Gengliang Wang
Terry Kim
Wenchen Fan
Xin Ren
Srabasti Banerjee
Priyanka Gomatam
*Topics*:
- Follow up on renaming append to insert in v2 API
- Changes to CatalogPlugin for v2 session catalog implementations
- Check on blockers
- Remove SaveMode - remove special case after file sources are
disabled?
- Reorganize packages
- Open PRs
- DataFrameWriterV2: https://github.com/apache/spark/pull/25354
- SHOW TABLES: https://github.com/apache/spark/pull/25247
- https://github.com/apache/spark/pull/25507
*Discussion*:
- Insert in DataFrameWriter v2 API:
- Ryan: After reviewing the doc that Russel sent
<https://en.wikipedia.org/wiki/Merge_(SQL>) last time, it doesn’t
look like there is precedent for insert implementing upsert, without an
additional clause like ON DUPLICATE KEY UPDATE or ON CONFLICT .... I
think that means that insert should not be used for upsert and it is
correct to use the verb “append” in the new API.
- Wenchen: Spark already supports INSERT OVERWRITE that has no
precedent other than Hive
- Ryan: Good point. INSERT OVERWRITE is a partition-level replace. If
we think of single-key stores as partitioned by row key, then
the dynamic INSERT
OVERWRITE behavior is appropriate.
- Ryan: One other reason to change “append” to “insert” is to match
SQL. Should we consider renaming for consistency with SQL?
- Burak: SQL inserts are by position and DataFrameWriterV2 appends
are by name. DataFrameWriter (v1) uses position for insertInto,
so there is
a precedent that insert is by position.
- Ryan: I agree with that logic. It makes more sense to use append to
distinguish behavior.
- Consensus was to keep the “append” verb, but will discuss when
Russel is back.
- Burak: (Continuing from DataFrameWriterV2 discussion) The v2 writer
looks fine other than the partition functions are close to built-in
expression functions (year vs years).
- Consensus was to use “partitioning.years” for partitioning
functions.
- Changes to CatalogPlugin for v2 session catalog implementations
- Wenchen: this adds a new config for overriding v2 session catalog,
and a new abstract class that must be implemented
- Ryan: Why a new config? If we intend for a user to be able to
override this, then we already have a mechanism to configure it using the
“session” catalog name.
- Discussion on pros and cons of using a different config, consensus
was to use the existing CatalogPlugin config
- Ryan: Looks like this uses TableCatalog for the actual API and
passes in the built-in V2SessionCatalog. That sounds like a good idea to
me, instead of introducing a new API.
- Burak: What about databases named “session”?
- Ryan: Catalogs take precedence over databases, so the session
catalog will be used for “session.table”.
- Burak: Sounds like this is going to break existing queries then.
- Ryan: I think the general rule that catalogs should take precedence
is right. It would be worse to allow users creating databases to break
other users’ catalogs — we avoid that problem with catalogs because they
are limited to jobs when users create them and are otherwise an
administrator option. But, session is a special case because
this is Spark
building a catalog into all environments… I think that’s a good reason to
name it something that we think is unlikely to conflict.
- Discussion came up with several alternatives (*session*, built_in,
etc) but consensus settled on “spark_catalog”. That’s more
descriptive and
much less likely to conflict with existing databases.
- Remove SaveMode: this was done by Wenchen’s commit that disabled file
sources.
--
Ryan Blue
Software Engineer
Netflix