Here are my notes from the last sync. If you’d like to be added to the invite or have topics, please let me know.
*Attendees*: Ryan Blue Matt Cheah Yifei Huang Jose Torres Burak Yavuz Gengliang Wang Michael Artz Russel Spitzer *Topics*: - Existing PRs - V2 session catalog: https://github.com/apache/spark/pull/24768 - REPLACE and RTAS: https://github.com/apache/spark/pull/24798 - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040 - ALTER TABLE: https://github.com/apache/spark/pull/24937 - INSERT INTO: https://github.com/apache/spark/pull/24832 - Stats integration - CTAS and DataFrameWriter behavior *Discussion*: - ALTER TABLE PR is ready to commit (and was after the sync) - REPLACE and RTAS PR: waiting on more reviews - INSERT INTO PR: Ryan will review - DESCRIBE TABLE has test failures, Matt will fix - V2 session catalog: - How will v2 catalog be configured? - Ryan: This is up for discussion because it currently uses a table property. I think it needs to be configurable - Burak: Agree that it should be configurable - Ryan: Does this need to be determined now, or can we solve this after getting the functionality in? - Jose: let’s get it in and fix it later - Stats integration: - Matt: has anyone looked at stats integration? What needs to be done? - Ryan: stats are part of the Scan API. Configure a scan with ScanBuilder and then get stats from it. The problem is that this happens when converting to physical plan, after the optimizer. But the optimizer determines what gets broadcasted. A work-around Netflix uses is to run push down in the stats code. This runs push-down twice and was rejected from Spark, but is important for performance. We should add a property to enable this. - Ryan: The larger problem is that stats are used in the optimizer, but push-down happens when converting to physical plan. This is also related to our earlier discussions about when join types are chosen. Fixing this is a big project - CTAS and DataFrameWriter behavior - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is difficult to predict v1 behavior - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We suggest a replacement with clear verbs for each SQL action: append/insert, overwrite, overwriteDynamic, create (table), replace (table) - Ryan: Prototype available here: https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f -- Ryan Blue Software Engineer Netflix