Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

Russell Spitzer Wed, 20 May 2020 15:08:09 -0700

I think those are fair concerns, I was mostly just updating tests for RC2
and adding in "append" everywhere


Code like

spark.sql(s"SELECT a, b from $ks.test1")
  .write
  .format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "test_insert1", "keyspace" -> ks))
  .save()

Now fails at runtime, while it would have succeeded before. This is
probably not a huge issue since the majority of actual usages aren't
writing to empty tables.

I think my main concern here is that a lot of our old demos and tutorials
where

* Make table outside of Spark
* Write to table with spark

Now obviously they can be done in a single operation in spark so that's
probably the best path forward. The old pathway is pretty awkward, I just
didn't really want it to break it didn't have to but I think having
different defaults is definitely not intuitive.

I think the majority of other use cases are "append" anyway so it's not a
big pain for non-demo / just trying things out users.

Thanks for commenting,
Russ


On Wed, May 20, 2020 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote:

> The context on this is that it was confusing that the mode changed, which
> introduced different behaviors for the same user code when moving from v1
> to v2. Burak pointed this out and I agree that it's weird that if your
> dependency changes from v1 to v2, your compiled Spark job starts appending
> instead of erroring out when the table exists.
>
> The work-around is to implement a new trait, SupportsCatalogOptions, that
> allows you to extract a table identifier and catalog name from the options
> in the DataFrameReader. That way, you can re-route to your catalog so that
> Spark correctly uses a CreateTableAsSelect statement for ErrorIfExists
> mode.
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java
>
> On Wed, May 20, 2020 at 2:50 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>>
>> While the ScalaDocs for DataFrameWriter say
>>
>> /**
>>  * Specifies the behavior when data or table already exists. Options include:
>>  * <ul>
>>  * <li>`SaveMode.Overwrite`: overwrite the existing data.</li>
>>  * <li>`SaveMode.Append`: append the data.</li>
>>  * <li>`SaveMode.Ignore`: ignore the operation (i.e. no-op).</li>
>>  * <li>`SaveMode.ErrorIfExists`: throw an exception at runtime.</li>
>>  * </ul>
>>  * <p>
>>  * When writing to data source v1, the default option is `ErrorIfExists`. 
>> When writing to data
>>  * source v2, the default option is `Append`.
>>  *
>>  * @since 1.4.0
>>  */
>>
>>
>> As far as I can tell, using DataFrame writer with a TableProviding
>> DataSource V2 will still default to ErrorIfExists which breaks existing
>> code since DSV2 cannot support ErrorIfExists mode. I noticed in the history
>> of DataframeWriter there were versions which differentiated between DSV2
>> and DSV1 and set the mode accordingly but this seems to no longer be the
>> case. Was this intentional? I feel like if we could
>> have the default be based on the Source then upgrading code from DSV1 ->
>> DSV2 would be much easier for users.
>>
>> I'm currently testing this on RC2
>>
>>
>> Any thoughts?
>>
>> Thanks for your time as usual,
>> Russ
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

Reply via email to