[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

rdblue Wed, 08 Aug 2018 09:09:02 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208642275
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/StreamingWriteSupportProvider.java
 ---
    @@ -29,24 +28,24 @@
      * provide data writing ability for structured streaming.
      */
     @InterfaceStability.Evolving
    -public interface StreamWriteSupport extends DataSourceV2, 
BaseStreamingSink {
    +public interface StreamingWriteSupportProvider extends DataSourceV2, 
BaseStreamingSink {
     
    -    /**
    -     * Creates an optional {@link StreamWriter} to save the data to this 
data source. Data
    -     * sources can return None if there is no writing needed to be done.
    -     *
    -     * @param queryId A unique string for the writing query. It's possible 
that there are many
    -     *                writing queries running at the same time, and the 
returned
    -     *                {@link DataSourceWriter} can use this id to 
distinguish itself from others.
    -     * @param schema the schema of the data to be written.
    -     * @param mode the output mode which determines what successive epoch 
output means to this
    -     *             sink, please refer to {@link OutputMode} for more 
details.
    -     * @param options the options for the returned data source writer, 
which is an immutable
    -     *                case-insensitive string-to-string map.
    -     */
    -    StreamWriter createStreamWriter(
    -        String queryId,
    -        StructType schema,
    -        OutputMode mode,
    -        DataSourceOptions options);
    +  /**
    +   * Creates an optional {@link StreamingWriteSupport} to save the data to 
this data source. Data
    +   * sources can return None if there is no writing needed to be done.
    +   *
    +   * @param queryId A unique string for the writing query. It's possible 
that there are many
    +   *                writing queries running at the same time, and the 
returned
    +   *                {@link StreamingWriteSupport} can use this id to 
distinguish itself from others.
    +   * @param schema the schema of the data to be written.
    +   * @param mode the output mode which determines what successive epoch 
output means to this
    +   *             sink, please refer to {@link OutputMode} for more details.
    +   * @param options the options for the returned data source writer, which 
is an immutable
    +   *                case-insensitive string-to-string map.
    +   */
    +  StreamingWriteSupport createStreamingWritSupport(
    +    String queryId,
    --- End diff --
    
    If it needs to be there for streaming, then let's make sure it is in both 
APIs. It can help when debugging writes in batch, too.
    
    One more thing: isn't the abstraction that a `WriteSupport` is something 
that can be written to, like the `ReadSupport` is something that can be 
scanned? A Table fits both, as do Streams.
    
    If that's the case, then why pass the query ID when creating the 
`WriteSupport` or stream? The stream doesn't need a UUID, the actual write 
does. On the read side, there's `ScanConfig` that is used to hold the state for 
a scan, but on the read side there is no equivalent and we end up with odd uses 
of the abstraction like this.
    
    What about creating an equivalent of `ScanConfig` for the write side?
    
    @jose-torres, it would be great to hear your opinion on this, too.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to