Here are my notes from the DSv2 sync last night. As always, if you have corrections, please reply with them. And if you’d like to be included on the invite to participate in the next sync (6 March), send me an email.
Here’s a quick summary of the topics where we had consensus last night: - The behavior of v1 sources needs to be documented to come up with a migration plan - Spark 3.0 should include DSv2, even if it would delay the release (pending community discussion and vote) - Design for the v2 Catalog plugin system - V2 catalog approach of separate TableCatalog, FunctionCatalog, and ViewCatalog interfaces - Common v2 Table metadata should be schema, partitioning, and string-map of properties; leaving out sorting for now. (Ready to vote on metadata SPIP.) *Topics*: - Issues raised by ORC v2 commit - Migration to v2 sources - Roadmap and current blockers - Catalog plugin system - Catalog API separate interfaces approach - Catalog API metadata (schema, partitioning, and properties) - Public catalog API proposal *Notes*: - Issues raised by ORC v2 commit - Ryan: Disabled change to use v2 by default in PR for overwrite plans: tests rely on CTAS, which is not implemented in v2. - Wenchen: suggested using a StagedTable to work around not having a CTAS finished. TableProvider could create a staged table. - Ryan: Using StagedTable doesn’t make sense to me. It was intended to solve a different problem (atomicity). Adding an interface to create a staged table either requires the same metadata as CTAS or requires a blank staged table, which isn’t the same concept: these staged tables would behave entirely differently than the ones for atomic operations. Better to spend time getting CTAS done and work through the long-term plan than to hack around it. - Second issue raised by the ORC work: how to support tables that use different validations. - Ryan: What Gengliang’s PRs are missing is a clear definition of what tables require different validation and what that validation should be. In some cases, CTAS is validated against existing data [Ed: this is PreprocessTableCreation] and in some cases, Append has no validation because the table doesn’t exist. What isn’t clear is when these validations are applied. - Ryan: Without knowing exactly how v1 works, we can’t mirror that behavior in v2. Building a way to turn off validation is going to be needed, but is insufficient without knowing when to apply it. - Ryan: We also don’t know if it will make sense to maintain all of these rules to mimic v1 behavior. In v1, CTAS and Append can both write to existing tables, but use different rules to validate. What are the differences between them? It is unlikely that Spark will support both as options, if that is even possible. [Ed: see later discussion on migration that continues this.] - Gengliang: Using SaveMode is an option. - Ryan: Using SaveMode only appears to fix this, but doesn’t actually test v2. Using SaveMode appears to work because it disables all validation and uses code from v1 that will “create” tables by writing. But this isn’t helpful for the v2 goal of having defined and reliable behavior. - Gengliang: SaveMode is not correctly translated. Append could mean AppendData or CTAS. - Ryan: This is why we need to focus on finishing the v2 plans: so we can correctly translate the SaveMode into the right plan. That depends on having a catalog for CTAS and to check the existence of a table. - Wenchen: Catalog doesn’t support path tables, so how does this help? - Ryan: The multi-catalog identifiers proposal includes a way to pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a catalog implementation to handle path-based tables. The identifier will also have a method to test whether the identifier is a path identifier and catalogs are not required to support path identifiers. - Migration to v2 sources - Hyukjin: Once the ORC upgrade is done how will we move from v1 to v2? - Ryan: We will need to develop v1 and v2 in parallel. There are many code paths in v1 and we don’t know exactly what they do. We first need to know what they do and make a migration plan after that. - Hyukjin: What if there are many behavior differences? Will this require an API to opt in for each one? - Ryan: Without knowing how v1 behaves, we can only speculate. But I don’t think that we will want to support many of these special cases. That is a lot of work and maintenance. - Gengliang: When can we change the default to v2? Until we change the default, v2 is not tested. The v2 work is blocked by this. - Ryan: v2 work should not be blocked by finishing CTAS and other plans. This can proceed in parallel. - Matt: We don’t need to use the existing tests, we can add tests for v2 below the DF writer level. - Gengliang: But those tests would not be end-to-end. - Ryan: For end-to-end tests, we should add a new DataFrame write API. That is going to be needed to move entirely to v2 and drop v1 behavior hacks anyway. Adding it now fixes both problems. - Matt: Supports the idea of adding the DF v2 write API now. - *Consensus for documenting the behavior of v1* (Gengliang will work on this because it affects his work.) - Roadmap: - Matt (I think): Community should commit to finishing planned work on DSv2 for Spark 3.0. - Ryan: Agree, we can’t wait forever and lots of this work has been pending for a year now. If this doesn’t make it into 3.0, we will need to consider other options. - Felix: Goal should be 3.0 even if it requires delaying the release. - *Consensus: Spark 3.0 should include DSv2, even if it requires delaying the release.* Ryan will start a discussion thread about committing to DSv2 in Spark 3.0. - Matt: What work is outstanding DSv2? - Ryan: Addition of TableCatalog API, catalog plugin system, CTAS implementation. - Matt: What blocks those things? - Ryan: Next blocker is agreement on catalog plugin system, catalog API approach (separate TableCatalog, FunctionCatalog, etc.), and TableCatalog metadata. - *Consensus formed for catalog plugin system* (as previously discussed) - *Consensus formed for catalog API approach* - *Consensus formed for TableCatalog metadata in SPIP* - Ryan will start a vote thread for this SPIP - Ryan: The metadata SPIP also includes a public API that isn’t required. Will move to implementation sketch so it is informational. - Wenchen: InternalRow is written in Scala and needs a stable API - Ryan: Can we do the InternalRow fix later? - Wenchen: Yes, not a blocker. - Ryan: Table metadata also contains sort information. - Wenchen: Bucketing contains sort information, but it isn’t used because it applies only to single files. - *Consensus formed not including sorts in v2 table metadata.* *Attendees*: Ryan Blue John Zhuge Donjoon Hyun Felix Cheung Gengliang Wang Hyukji Kwon Jacky Lee Jamison Bennett Matt Cheah Yifei Huang Russel Spitzer Wenchen Fan Yuanjian Li -- Ryan Blue Software Engineer Netflix