We discussed this issue in the sync. I'll be sending out a summary later today, but we came to a conclusion on some of these.
For #1, there are 2 parts: the design and the implementation. We agreed that the design should not include SaveMode. The implementation may include SaveMode until we can replace it with Overwrite, #2. We decided to create a release-blocking issue to remove SaveMode so we will not include the redesign to DataSourceV2 in a release unless SaveMode has been removed from the read/write API (not the public API). Let's continue discussions on #3. I don't think removing SaveMode needs to be blocked by this because the justification for keeping SaveMode was to not break existing tests. Existing tests only rely on overwrite. I agree that CTAS is important and I'd prefer to get that in before a release as well, though we didn't talk about that. rb On Wed, Dec 12, 2018 at 4:58 PM Reynold Xin <r...@databricks.com> wrote: > Unfortunately I can't make it to the DSv2 sync today. Sending an email > with my thoughts instead. I spent a few hours thinking about this. It's > evident that progress has been slow, because this is an important API and > people from different perspectives have very different requirements, and > the priorities are weighted very differently (e.g. issues that are super > important to one might be not as important to another, and people just talk > past each other arguing why one ignored a broader issue in a PR or > proposal). > > I think the only real way to make progress is to decouple the efforts into > major areas, and make progress somewhat independently. Of course, some care > is needed to take care of > > Here's one attempt at listing some of the remaining big rocks: > > 1. Basic write API -- with the current SaveMode. > > 2. Add Overwrite (or Replace) logical plan, and the associated API in > Table. > > 3. Add APIs for per-table metadata operations (note that I'm not calling > it a catalog API here). Create/drop/alter table goes here. We also need to > figure out how to do this for the file system sources in which there is no > underlying catalog. One idea is to treat the file system as a catalog (with > arbitrary levels of databases). To do that, it'd be great if the identifier > for a table is not a fixed 2 or 3 part name, but just a string array. > > 4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3. > > 5. Design a stable, fast, smaller surface row format to replace the > existing InternalRow (and all the internal data types), which is internal > and unstable. This can be further decoupled into the design for each data > type. > > The above are the big one I can think of. I probably missed some, but a > lot of other smaller things can be improved on later. > > > > > > > -- Ryan Blue Software Engineer Netflix