Re: dsv2 remaining work

Ryan Blue Thu, 13 Dec 2018 09:15:14 -0800

We discussed this issue in the sync. I'll be sending out a summary later
today, but we came to a conclusion on some of these.

For #1, there are 2 parts: the design and the implementation. We agreed
that the design should not include SaveMode. The implementation may include
SaveMode until we can replace it with Overwrite, #2. We decided to create a
release-blocking issue to remove SaveMode so we will not include the
redesign to DataSourceV2 in a release unless SaveMode has been removed from
the read/write API (not the public API).

Let's continue discussions on #3. I don't think removing SaveMode needs to
be blocked by this because the justification for keeping SaveMode was to
not break existing tests. Existing tests only rely on overwrite. I agree
that CTAS is important and I'd prefer to get that in before a release as
well, though we didn't talk about that.

rb

On Wed, Dec 12, 2018 at 4:58 PM Reynold Xin <r...@databricks.com> wrote:

> Unfortunately I can't make it to the DSv2 sync today. Sending an email
> with my thoughts instead. I spent a few hours thinking about this. It's
> evident that progress has been slow, because this is an important API and
> people from different perspectives have very different requirements, and
> the priorities are weighted very differently (e.g. issues that are super
> important to one might be not as important to another, and people just talk
> past each other arguing why one ignored a broader issue in a PR or
> proposal).
>
> I think the only real way to make progress is to decouple the efforts into
> major areas, and make progress somewhat independently. Of course, some care
> is needed to take care of
>
> Here's one attempt at listing some of the remaining big rocks:
>
> 1. Basic write API -- with the current SaveMode.
>
> 2. Add Overwrite (or Replace) logical plan, and the associated API in
> Table.
>
> 3. Add APIs for per-table metadata operations (note that I'm not calling
> it a catalog API here). Create/drop/alter table goes here. We also need to
> figure out how to do this for the file system sources in which there is no
> underlying catalog. One idea is to treat the file system as a catalog (with
> arbitrary levels of databases). To do that, it'd be great if the identifier
> for a table is not a fixed 2 or 3 part name, but just a string array.
>
> 4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3.
>
> 5. Design a stable, fast, smaller surface row format to replace the
> existing InternalRow (and all the internal data types), which is internal
> and unstable. This can be further decoupled into the design for each data
> type.
>
> The above are the big one I can think of. I probably missed some, but a
> lot of other smaller things can be improved on later.
>
>
>
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: dsv2 remaining work

Reply via email to