Another related issue for backwards compatibility, In Datasource.scala https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L415-L416
Will get triggered even when the class is a Valid DatasourceV2 but being used in a DatasourceV1 Context. For example if I run val createDDL = s"""CREATE TABLE IF NOT EXISTS $registerName |USING org.apache.spark.sql.cassandra |OPTIONS ( | $optionString |)""".stripMargin spark.sql(createDDL) On the default catalog, I will get an exception org.apache.spark.sql.AnalysisException: org.apache.spark.sql.cassandra is not a valid Spark SQL Data Source.; I can understand that perhaps we no longer want DSV2 sources to be able to create session catalog entries anymore, but I think at a bare minimum we should change the error message in case of a V2 source recommending the DSV2 api On Wed, May 20, 2020 at 5:07 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > I think those are fair concerns, I was mostly just updating tests for RC2 > and adding in "append" everywhere > > Code like > > spark.sql(s"SELECT a, b from $ks.test1") > .write > .format("org.apache.spark.sql.cassandra") > .options(Map("table" -> "test_insert1", "keyspace" -> ks)) > .save() > > Now fails at runtime, while it would have succeeded before. This is > probably not a huge issue since the majority of actual usages aren't > writing to empty tables. > > I think my main concern here is that a lot of our old demos and tutorials > where > > * Make table outside of Spark > * Write to table with spark > > Now obviously they can be done in a single operation in spark so that's > probably the best path forward. The old pathway is pretty awkward, I just > didn't really want it to break it didn't have to but I think having > different defaults is definitely not intuitive. > > I think the majority of other use cases are "append" anyway so it's not a > big pain for non-demo / just trying things out users. > > Thanks for commenting, > Russ > > > On Wed, May 20, 2020 at 5:00 PM Ryan Blue <rb...@netflix.com> wrote: > >> The context on this is that it was confusing that the mode changed, which >> introduced different behaviors for the same user code when moving from v1 >> to v2. Burak pointed this out and I agree that it's weird that if your >> dependency changes from v1 to v2, your compiled Spark job starts appending >> instead of erroring out when the table exists. >> >> The work-around is to implement a new trait, SupportsCatalogOptions, that >> allows you to extract a table identifier and catalog name from the options >> in the DataFrameReader. That way, you can re-route to your catalog so that >> Spark correctly uses a CreateTableAsSelect statement for ErrorIfExists >> mode. >> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java >> >> On Wed, May 20, 2020 at 2:50 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> >>> While the ScalaDocs for DataFrameWriter say >>> >>> /** >>> * Specifies the behavior when data or table already exists. Options >>> include: >>> * <ul> >>> * <li>`SaveMode.Overwrite`: overwrite the existing data.</li> >>> * <li>`SaveMode.Append`: append the data.</li> >>> * <li>`SaveMode.Ignore`: ignore the operation (i.e. no-op).</li> >>> * <li>`SaveMode.ErrorIfExists`: throw an exception at runtime.</li> >>> * </ul> >>> * <p> >>> * When writing to data source v1, the default option is `ErrorIfExists`. >>> When writing to data >>> * source v2, the default option is `Append`. >>> * >>> * @since 1.4.0 >>> */ >>> >>> >>> As far as I can tell, using DataFrame writer with a TableProviding >>> DataSource V2 will still default to ErrorIfExists which breaks existing >>> code since DSV2 cannot support ErrorIfExists mode. I noticed in the history >>> of DataframeWriter there were versions which differentiated between DSV2 >>> and DSV1 and set the mode accordingly but this seems to no longer be the >>> case. Was this intentional? I feel like if we could >>> have the default be based on the Source then upgrading code from DSV1 -> >>> DSV2 would be much easier for users. >>> >>> I'm currently testing this on RC2 >>> >>> >>> Any thoughts? >>> >>> Thanks for your time as usual, >>> Russ >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >