DataSourceV2 sync notes - 20 Feb 2019

Ryan Blue Thu, 21 Feb 2019 12:57:03 -0800

Here are my notes from the DSv2 sync last night. As always, if you have
corrections, please reply with them. And if you’d like to be included on
the invite to participate in the next sync (6 March), send me an email.


Here’s a quick summary of the topics where we had consensus last night:

   - The behavior of v1 sources needs to be documented to come up with a
   migration plan
   - Spark 3.0 should include DSv2, even if it would delay the release
   (pending community discussion and vote)
   - Design for the v2 Catalog plugin system
   - V2 catalog approach of separate TableCatalog, FunctionCatalog, and
   ViewCatalog interfaces
   - Common v2 Table metadata should be schema, partitioning, and
   string-map of properties; leaving out sorting for now. (Ready to vote on
   metadata SPIP.)

*Topics*:

   - Issues raised by ORC v2 commit
   - Migration to v2 sources
   - Roadmap and current blockers
   - Catalog plugin system
   - Catalog API separate interfaces approach
   - Catalog API metadata (schema, partitioning, and properties)
   - Public catalog API proposal

*Notes*:

   - Issues raised by ORC v2 commit
      - Ryan: Disabled change to use v2 by default in PR for overwrite
      plans: tests rely on CTAS, which is not implemented in v2.
      - Wenchen: suggested using a StagedTable to work around not having a
      CTAS finished. TableProvider could create a staged table.
      - Ryan: Using StagedTable doesn’t make sense to me. It was intended
      to solve a different problem (atomicity). Adding an interface to create a
      staged table either requires the same metadata as CTAS or
requires a blank
      staged table, which isn’t the same concept: these staged tables would
      behave entirely differently than the ones for atomic operations.
Better to
      spend time getting CTAS done and work through the long-term plan than to
      hack around it.
      - Second issue raised by the ORC work: how to support tables that use
      different validations.
      - Ryan: What Gengliang’s PRs are missing is a clear definition of
      what tables require different validation and what that validation should
      be. In some cases, CTAS is validated against existing data [Ed: this is
      PreprocessTableCreation] and in some cases, Append has no validation
      because the table doesn’t exist. What isn’t clear is when these
validations
      are applied.
      - Ryan: Without knowing exactly how v1 works, we can’t mirror that
      behavior in v2. Building a way to turn off validation is going to be
      needed, but is insufficient without knowing when to apply it.
      - Ryan: We also don’t know if it will make sense to maintain all of
      these rules to mimic v1 behavior. In v1, CTAS and Append can
both write to
      existing tables, but use different rules to validate. What are the
      differences between them? It is unlikely that Spark will support both as
      options, if that is even possible. [Ed: see later discussion on migration
      that continues this.]
      - Gengliang: Using SaveMode is an option.
      - Ryan: Using SaveMode only appears to fix this, but doesn’t actually
      test v2. Using SaveMode appears to work because it disables all
validation
      and uses code from v1 that will “create” tables by writing. But
this isn’t
      helpful for the v2 goal of having defined and reliable behavior.
      - Gengliang: SaveMode is not correctly translated. Append could mean
      AppendData or CTAS.
      - Ryan: This is why we need to focus on finishing the v2 plans: so we
      can correctly translate the SaveMode into the right plan. That depends on
      having a catalog for CTAS and to check the existence of a table.
      - Wenchen: Catalog doesn’t support path tables, so how does this help?
      - Ryan: The multi-catalog identifiers proposal includes a way to pass
      paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a
      catalog implementation to handle path-based tables. The identifier will
      also have a method to test whether the identifier is a path
identifier and
      catalogs are not required to support path identifiers.
   - Migration to v2 sources
      - Hyukjin: Once the ORC upgrade is done how will we move from v1 to
      v2?
      - Ryan: We will need to develop v1 and v2 in parallel. There are many
      code paths in v1 and we don’t know exactly what they do. We first need to
      know what they do and make a migration plan after that.
      - Hyukjin: What if there are many behavior differences? Will this
      require an API to opt in for each one?
      - Ryan: Without knowing how v1 behaves, we can only speculate. But I
      don’t think that we will want to support many of these special
cases. That
      is a lot of work and maintenance.
      - Gengliang: When can we change the default to v2? Until we change
      the default, v2 is not tested. The v2 work is blocked by this.
      - Ryan: v2 work should not be blocked by finishing CTAS and other
      plans. This can proceed in parallel.
      - Matt: We don’t need to use the existing tests, we can add tests for
      v2 below the DF writer level.
      - Gengliang: But those tests would not be end-to-end.
      - Ryan: For end-to-end tests, we should add a new DataFrame write
      API. That is going to be needed to move entirely to v2 and drop
v1 behavior
      hacks anyway. Adding it now fixes both problems.
      - Matt: Supports the idea of adding the DF v2 write API now.
      - *Consensus for documenting the behavior of v1* (Gengliang will work
      on this because it affects his work.)
   - Roadmap:
      - Matt (I think): Community should commit to finishing planned work
      on DSv2 for Spark 3.0.
      - Ryan: Agree, we can’t wait forever and lots of this work has been
      pending for a year now. If this doesn’t make it into 3.0, we will need to
      consider other options.
      - Felix: Goal should be 3.0 even if it requires delaying the release.
      - *Consensus: Spark 3.0 should include DSv2, even if it requires
      delaying the release.* Ryan will start a discussion thread about
      committing to DSv2 in Spark 3.0.
      - Matt: What work is outstanding DSv2?
      - Ryan: Addition of TableCatalog API, catalog plugin system, CTAS
      implementation.
      - Matt: What blocks those things?
      - Ryan: Next blocker is agreement on catalog plugin system, catalog
      API approach (separate TableCatalog, FunctionCatalog, etc.), and
      TableCatalog metadata.
      - *Consensus formed for catalog plugin system* (as previously
      discussed)
      - *Consensus formed for catalog API approach*
      - *Consensus formed for TableCatalog metadata in SPIP* - Ryan will
      start a vote thread for this SPIP
      - Ryan: The metadata SPIP also includes a public API that isn’t
      required. Will move to implementation sketch so it is informational.
      - Wenchen: InternalRow is written in Scala and needs a stable API
      - Ryan: Can we do the InternalRow fix later?
      - Wenchen: Yes, not a blocker.
      - Ryan: Table metadata also contains sort information.
      - Wenchen: Bucketing contains sort information, but it isn’t used
      because it applies only to single files.
      - *Consensus formed not including sorts in v2 table metadata.*

*Attendees*:
Ryan Blue
John Zhuge
Donjoon Hyun
Felix Cheung
Gengliang Wang
Hyukji Kwon
Jacky Lee
Jamison Bennett
Matt Cheah
Yifei Huang
Russel Spitzer
Wenchen Fan
Yuanjian Li
-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 20 Feb 2019

Reply via email to