DataSourceV2 sync notes - 10 July 2019

Ryan Blue Fri, 19 Jul 2019 17:21:01 -0700

Here are my notes from the last sync. If you’d like to be added to the
invite or have topics, please let me know.


*Attendees*:

Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer

*Topics*:

   - Existing PRs
      - V2 session catalog: https://github.com/apache/spark/pull/24768
      - REPLACE and RTAS: https://github.com/apache/spark/pull/24798
      - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
      - ALTER TABLE: https://github.com/apache/spark/pull/24937
      - INSERT INTO: https://github.com/apache/spark/pull/24832
   - Stats integration
   - CTAS and DataFrameWriter behavior

*Discussion*:

   - ALTER TABLE PR is ready to commit (and was after the sync)
   - REPLACE and RTAS PR: waiting on more reviews
   - INSERT INTO PR: Ryan will review
   - DESCRIBE TABLE has test failures, Matt will fix
   - V2 session catalog:
      - How will v2 catalog be configured?
      - Ryan: This is up for discussion because it currently uses a table
      property. I think it needs to be configurable
      - Burak: Agree that it should be configurable
      - Ryan: Does this need to be determined now, or can we solve this
      after getting the functionality in?
      - Jose: let’s get it in and fix it later
   - Stats integration:
      - Matt: has anyone looked at stats integration? What needs to be done?
      - Ryan: stats are part of the Scan API. Configure a scan with
      ScanBuilder and then get stats from it. The problem is that this happens
      when converting to physical plan, after the optimizer. But the optimizer
      determines what gets broadcasted. A work-around Netflix uses is
to run push
      down in the stats code. This runs push-down twice and was rejected from
      Spark, but is important for performance. We should add a
property to enable
      this.
      - Ryan: The larger problem is that stats are used in the optimizer,
      but push-down happens when converting to physical plan. This is also
      related to our earlier discussions about when join types are
chosen. Fixing
      this is a big project
   - CTAS and DataFrameWriter behavior
      - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
      difficult to predict v1 behavior
      - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
      suggest a replacement with clear verbs for each SQL action:
append/insert,
      overwrite, overwriteDynamic, create (table), replace (table)
      - Ryan: Prototype available here:
      https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 10 July 2019

Reply via email to