[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

cloud-fan Tue, 14 Aug 2018 19:13:54 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r210155543
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/BatchWriteSupportProvider.java
 ---
    @@ -21,33 +21,39 @@
     
     import org.apache.spark.annotation.InterfaceStability;
     import org.apache.spark.sql.SaveMode;
    -import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
    +import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport;
     import org.apache.spark.sql.types.StructType;
     
     /**
      * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
    - * provide data writing ability and save the data to the data source.
    + * provide data writing ability for batch processing.
    + *
    + * This interface is used when end users want to use a data source 
implementation directly, e.g.
    + * {@code Dataset.write.format(...).option(...).save()}.
      */
     @InterfaceStability.Evolving
    -public interface WriteSupport extends DataSourceV2 {
    +public interface BatchWriteSupportProvider extends DataSourceV2 {
     
       /**
    -   * Creates an optional {@link DataSourceWriter} to save the data to this 
data source. Data
    +   * Creates an optional {@link BatchWriteSupport} to save the data to 
this data source. Data
        * sources can return None if there is no writing needed to be done 
according to the save mode.
        *
        * If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
        * submitted.
        *
    -   * @param writeUUID A unique string for the writing job. It's possible 
that there are many writing
    -   *                  jobs running at the same time, and the returned 
{@link DataSourceWriter} can
    -   *                  use this job id to distinguish itself from other 
jobs.
    +   * @param queryId A unique string for the writing query. It's possible 
that there are many
    +   *                writing queries running at the same time, and the 
returned
    +   *                {@link BatchWriteSupport} can use this id to 
distinguish itself from others.
        * @param schema the schema of the data to be written.
        * @param mode the save mode which determines what to do when the data 
are already in this data
        *             source, please refer to {@link SaveMode} for more details.
        * @param options the options for the returned data source writer, which 
is an immutable
        *                case-insensitive string-to-string map.
    -   * @return a writer to append data to this data source
    +   * @return a write support to write data to this data source.
        */
    -  Optional<DataSourceWriter> createWriter(
    -      String writeUUID, StructType schema, SaveMode mode, 
DataSourceOptions options);
    +  Optional<BatchWriteSupport> createBatchWriteSupport(
    +      String queryId,
    +      StructType schema,
    +      SaveMode mode,
    --- End diff --
    
    I totally agree that `SaveMode` is a bad API which leads to undefined 
behavior. That's why we start a project to design new DDL logical plans and 
write APIs. However, I believe we had an agreement before that we can't remove 
existing APIs, so the `DataFrameWriter` and `SaveMode` will still be there in 
Spark. If I'm a data source developer, even I've implemented the new write APIs 
(assuming it's finished), I would still support `SaveMode` to attract more 
users. `DataFrameWriter` is a very widely used API and the end users may need a 
long time to migrate to the new write APIs. BTW, file source (without catalog) 
does have a clearly defined behavior regarding `SaveMode`, we should make it 
possible to migrate file source to data source v2.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to