Re: DataSourceV2 sync notes - 10 July 2019

Wenchen Fan Tue, 23 Jul 2019 03:30:51 -0700

Hi Ryan,

Thanks for summarizing and sending out the meeting notes! Unfortunately, I
missed the last sync, but the topics are really interesting, especially the
stats integration.


The ideal solution I can think of is to refactor the optimizer/planner and
move all the stats-based optimization to the physical plan phase (or do it
during the planning). This needs a lot of design work and I'm not sure if
we can finish it in the near future.

Alternatively, we can do the operator pushdown at logical plan phase via
the optimizer. This is not ideal but I think is a better workaround than
doing pushdown twice. The parquet nested column pruning is also done at the
logical plan phase, so I think there are no serious problems if we do
operator pushdown at the logical plan phase.

This is only about the internal implementation so we can fix it at any
time. But this may hurt data source v2 performance a lot and we'd better
fix it sooner rather than later.


On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Here are my notes from the last sync. If you’d like to be added to the
> invite or have topics, please let me know.
>
> *Attendees*:
>
> Ryan Blue
> Matt Cheah
> Yifei Huang
> Jose Torres
> Burak Yavuz
> Gengliang Wang
> Michael Artz
> Russel Spitzer
>
> *Topics*:
>
>    - Existing PRs
>       - V2 session catalog: https://github.com/apache/spark/pull/24768
>       - REPLACE and RTAS: https://github.com/apache/spark/pull/24798
>       - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
>       - ALTER TABLE: https://github.com/apache/spark/pull/24937
>       - INSERT INTO: https://github.com/apache/spark/pull/24832
>    - Stats integration
>    - CTAS and DataFrameWriter behavior
>
> *Discussion*:
>
>    - ALTER TABLE PR is ready to commit (and was after the sync)
>    - REPLACE and RTAS PR: waiting on more reviews
>    - INSERT INTO PR: Ryan will review
>    - DESCRIBE TABLE has test failures, Matt will fix
>    - V2 session catalog:
>       - How will v2 catalog be configured?
>       - Ryan: This is up for discussion because it currently uses a table
>       property. I think it needs to be configurable
>       - Burak: Agree that it should be configurable
>       - Ryan: Does this need to be determined now, or can we solve this
>       after getting the functionality in?
>       - Jose: let’s get it in and fix it later
>    - Stats integration:
>       - Matt: has anyone looked at stats integration? What needs to be
>       done?
>       - Ryan: stats are part of the Scan API. Configure a scan with
>       ScanBuilder and then get stats from it. The problem is that this happens
>       when converting to physical plan, after the optimizer. But the optimizer
>       determines what gets broadcasted. A work-around Netflix uses is to run 
> push
>       down in the stats code. This runs push-down twice and was rejected from
>       Spark, but is important for performance. We should add a property to 
> enable
>       this.
>       - Ryan: The larger problem is that stats are used in the optimizer,
>       but push-down happens when converting to physical plan. This is also
>       related to our earlier discussions about when join types are chosen. 
> Fixing
>       this is a big project
>    - CTAS and DataFrameWriter behavior
>       - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
>       difficult to predict v1 behavior
>       - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
>       suggest a replacement with clear verbs for each SQL action: 
> append/insert,
>       overwrite, overwriteDynamic, create (table), replace (table)
>       - Ryan: Prototype available here:
>       https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 sync notes - 10 July 2019

Reply via email to