Thanks for responding!

I’ve been coming up with a list of the high-level operations that are
needed. I think all of them come down to 5 questions about what’s happening:

   - Does the target table exist?
   - If it does exist, should it be dropped?
   - If not, should it get created?
   - Should data get written to the table?
   - Should data get deleted from the table?

Using those, you can list out all the potential operations. Here’s a flow
chart that makes it easier to think about:

Table exists?          No                                        Yes
                        |                                         |
Drop table?            N/A                    Yes
<---------------+--------------> No
                        |                      |
             |
Create table?    Yes <--+--> No          Yes <-+-> No
           Exists
                  |          Noop         |        DropTable
             |
Write data? Yes <-+-> No            Yes <-+-> No
Yes <----------+---------> No
            CTAS      CreateTable   RTAS      ReplaceTable
|                         |
Delete data?                                                  Yes
<---+---> No           Yes <--+--> No

ReplaceData   InsertInto   DeleteFrom  Noop

Some of these can be broken down into other operations (replace table =
drop & create), but I think it is valuable to consider each one and think
about whether it should be atomic. CTAS is a create and an insert that
guarantees the table exists only if the insert succeeded. Should we also
support RTAS=ReplaceTableAsSelect (drop, create, insert) and make a similar
guarantee that the original table will be dropped if and only if the write
succeeds?

As a sanity check, most of these operations correspond to SQL statements
for tables

   - CreateTable = CREATE TABLE t
   - DropTable = DROP TABLE t
   - ReplaceTable = DROP TABLE t; CREATE TABLE t (no transaction needed?)
   - CTAS = CREATE TABLE t AS SELECT ...
   - RTAS = ??? (we could add REPLACE TABLE t AS ...)

Or for data:

   - DeleteFrom = DELETE FROM t WHERE ...
   - InsertInto = INSERT INTO t SELECT ...
   - ReplaceData = INSERT OVERWRITE t PARTITION (p) SELECT ... or BEGIN;
   DELETE FROM t; INSERT INTO t SELECT ...; COMMIT;

The last one, ReplaceData, is interesting because only one specific case is
currently supported and requires partitioning.

I think we need to consider all of these operations while building
DataSourceV2. We still need to define what v2 sources should do.

Also, I would like to see a way to provide weak guarantees easily and
another way for v2 sources to implement stronger guarantees. For example,
CTAS can be implemented as a create, then an insert, with a drop if the
insert fails. That covers most cases and is easy to implement. But some
table formats can provide stronger guarantees. Iceberg supports atomic
create-and-insert, so that a table ever exists unless its write succeeds,
and it’s not just rolled back if the driver is still alive after a failure.
If we implement the basics (create, insert, drop-on-failure) in Spark, I
think we will end up with more data sources that have reliable behavior.

Would anyone be interested in an improvement proposal for this? It would be
great to document this and build consensus around Spark’s expected
behavior. I can write it up.

rb
​

On Fri, Feb 2, 2018 at 3:23 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> So here are my recommendations for moving forward, with DataSourceV2 as a
>> starting point:
>>
>>    1. Use well-defined logical plan nodes for all high-level operations:
>>    insert, create, CTAS, overwrite table, etc.
>>    2. Use rules that match on these high-level plan nodes, so that it
>>    isn’t necessary to create rules to match each eventual code path
>>    individually
>>    3. Define Spark’s behavior for these logical plan nodes. Physical
>>    nodes should implement that behavior, but all CREATE TABLE OVERWRITE 
>> should
>>    (eventually) make the same guarantees.
>>    4. Specialize implementation when creating a physical plan, not
>>    logical plans.
>>
>> I realize this is really long, but I’d like to hear thoughts about this.
>> I’m sure I’ve left out some additional context, but I think the main idea
>> here is solid: lets standardize logical plans for more consistent behavior
>> and easier maintenance.
>>
> Context aside, I really like these rules! I think having query planning be
> the boundary for specialization makes a lot of sense.
>
> (RunnableCommand might also be my fault though.... sorry! :P)
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to