Here are my notes from the last sync. If you’d like to be added to the
invite or have topics, please let me know.
*Attendees*:
Ryan Blue
Matt Cheah
Yifei Huang
Jose Torres
Burak Yavuz
Gengliang Wang
Michael Artz
Russel Spitzer
*Topics*:
- Existing PRs
- V2 session catalog: https://github.com/apache/spark/pull/24768
- REPLACE and RTAS: https://github.com/apache/spark/pull/24798
- DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
- ALTER TABLE: https://github.com/apache/spark/pull/24937
- INSERT INTO: https://github.com/apache/spark/pull/24832
- Stats integration
- CTAS and DataFrameWriter behavior
*Discussion*:
- ALTER TABLE PR is ready to commit (and was after the sync)
- REPLACE and RTAS PR: waiting on more reviews
- INSERT INTO PR: Ryan will review
- DESCRIBE TABLE has test failures, Matt will fix
- V2 session catalog:
- How will v2 catalog be configured?
- Ryan: This is up for discussion because it currently uses a table
property. I think it needs to be configurable
- Burak: Agree that it should be configurable
- Ryan: Does this need to be determined now, or can we solve this
after getting the functionality in?
- Jose: let’s get it in and fix it later
- Stats integration:
- Matt: has anyone looked at stats integration? What needs to be done?
- Ryan: stats are part of the Scan API. Configure a scan with
ScanBuilder and then get stats from it. The problem is that this happens
when converting to physical plan, after the optimizer. But the optimizer
determines what gets broadcasted. A work-around Netflix uses is
to run push
down in the stats code. This runs push-down twice and was rejected from
Spark, but is important for performance. We should add a
property to enable
this.
- Ryan: The larger problem is that stats are used in the optimizer,
but push-down happens when converting to physical plan. This is also
related to our earlier discussions about when join types are
chosen. Fixing
this is a big project
- CTAS and DataFrameWriter behavior
- Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
difficult to predict v1 behavior
- Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
suggest a replacement with clear verbs for each SQL action:
append/insert,
overwrite, overwriteDynamic, create (table), replace (table)
- Ryan: Prototype available here:
https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
--
Ryan Blue
Software Engineer
Netflix