[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

rdblue Thu, 09 Aug 2018 10:46:36 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r209020054
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/BatchWriteSupportProvider.java
 ---
    @@ -21,33 +21,39 @@
     
     import org.apache.spark.annotation.InterfaceStability;
     import org.apache.spark.sql.SaveMode;
    -import org.apache.spark.sql.sources.v2.writer.DataSourceWriter;
    +import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport;
     import org.apache.spark.sql.types.StructType;
     
     /**
      * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
    - * provide data writing ability and save the data to the data source.
    + * provide data writing ability for batch processing.
    + *
    + * This interface is used when end users want to use a data source 
implementation directly, e.g.
    + * {@code Dataset.write.format(...).option(...).save()}.
      */
     @InterfaceStability.Evolving
    -public interface WriteSupport extends DataSourceV2 {
    +public interface BatchWriteSupportProvider extends DataSourceV2 {
     
       /**
    -   * Creates an optional {@link DataSourceWriter} to save the data to this 
data source. Data
    +   * Creates an optional {@link BatchWriteSupport} to save the data to 
this data source. Data
        * sources can return None if there is no writing needed to be done 
according to the save mode.
        *
        * If this method fails (by throwing an exception), the action will fail 
and no Spark job will be
        * submitted.
        *
    -   * @param writeUUID A unique string for the writing job. It's possible 
that there are many writing
    -   *                  jobs running at the same time, and the returned 
{@link DataSourceWriter} can
    -   *                  use this job id to distinguish itself from other 
jobs.
    +   * @param queryId A unique string for the writing query. It's possible 
that there are many
    +   *                writing queries running at the same time, and the 
returned
    +   *                {@link BatchWriteSupport} can use this id to 
distinguish itself from others.
        * @param schema the schema of the data to be written.
        * @param mode the save mode which determines what to do when the data 
are already in this data
        *             source, please refer to {@link SaveMode} for more details.
        * @param options the options for the returned data source writer, which 
is an immutable
        *                case-insensitive string-to-string map.
    -   * @return a writer to append data to this data source
    +   * @return a write support to write data to this data source.
        */
    -  Optional<DataSourceWriter> createWriter(
    -      String writeUUID, StructType schema, SaveMode mode, 
DataSourceOptions options);
    +  Optional<BatchWriteSupport> createBatchWriteSupport(
    +      String queryId,
    +      StructType schema,
    +      SaveMode mode,
    --- End diff --
    
    I don't think this is a good idea. Why introduce a legacy API into a new 
API? If we are moving old sources to the new API, then they should fully 
implement the new API and should not continue to expose the unpredictable v1 
behavior.
    
    That said, as long as the `TableCatalog` makes it in, I don't care what 
anonymous tables do because I don't intend for any of our sources to use this 
path.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to