[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

cloud-fan Thu, 09 Aug 2018 18:55:03 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r209126618
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/BatchWriteSupportProvider.java
 ---
    @@ -21,33 +21,39 @@
     
     import org.apache.spark.annotation.InterfaceStability;
     import org.apache.spark.sql.SaveMode;
    -import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
    +import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport;
     import org.apache.spark.sql.types.StructType;
     
     /**
      * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
    - * provide data writing ability and save the data to the data source.
    + * provide data writing ability for batch processing.
    + *
    + * This interface is used when end users want to use a data source 
implementation directly, e.g.
    + * {@code Dataset.write.format(...).option(...).save()}.
      */
     @InterfaceStability.Evolving
    -public interface WriteSupport extends DataSourceV2 {
    +public interface BatchWriteSupportProvider extends DataSourceV2 {
     
       /**
    -   * Creates an optional {@link DataSourceWriter} to save the data to this 
data source. Data
    +   * Creates an optional {@link BatchWriteSupport} to save the data to 
this data source. Data
        * sources can return None if there is no writing needed to be done 
according to the save mode.
        *
        * If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
        * submitted.
        *
    -   * @param writeUUID A unique string for the writing job. It's possible 
that there are many writing
    -   *                  jobs running at the same time, and the returned 
{@link DataSourceWriter} can
    -   *                  use this job id to distinguish itself from other 
jobs.
    +   * @param queryId A unique string for the writing query. It's possible 
that there are many
    +   *                writing queries running at the same time, and the 
returned
    +   *                {@link BatchWriteSupport} can use this id to 
distinguish itself from others.
        * @param schema the schema of the data to be written.
        * @param mode the save mode which determines what to do when the data 
are already in this data
        *             source, please refer to {@link SaveMode} for more details.
        * @param options the options for the returned data source writer, which 
is an immutable
        *                case-insensitive string-to-string map.
    -   * @return a writer to append data to this data source
    +   * @return a write support to write data to this data source.
        */
    -  Optional<DataSourceWriter> createWriter(
    -      String writeUUID, StructType schema, SaveMode mode, 
DataSourceOptions options);
    +  Optional<BatchWriteSupport> createBatchWriteSupport(
    +      String queryId,
    +      StructType schema,
    +      SaveMode mode,
    --- End diff --
    
    The problem here is, if we don't take `saveMode`, the only end-user API to 
write to a data source is: `df.write.format(...).mode("append").save()`. That 
makes data source v2 totally unusable before we introduce the new write APIs.
    
    I hope we can get this in before Spark 2.4, so that some data source 
projects can start migrating and experimenting.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to