[GitHub] [spark] cloud-fan commented on a change in pull request #25945: [SPARK-29248][SQL] Pass in number of partitions to WriteBuilder

GitBox Fri, 27 Sep 2019 01:00:26 -0700

cloud-fan commented on a change in pull request #25945: [SPARK-29248][SQL] Pass 
in number of partitions to WriteBuilder
URL: https://github.com/apache/spark/pull/25945#discussion_r328954607


 ##########
 File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/WriteBuilder.java
 ##########
 @@ -55,6 +55,16 @@ default WriteBuilder withInputDataSchema(StructType schema) 
{
     return this;
   }
 
+  /**
+   * Passes the number of partitions of the input data from Spark to data 
source.
+   *
+   * @return a new builder with the `schema`. By default it returns `this`, 
which means the given
+   *         `numPartitions` is ignored. Please override this method to take 
the `numPartitions`.
+   */
+  default WriteBuilder withNumPartitions(int numPartitions) {
 
 Review comment:
   I'm OK with the approach here, but just want to share a few thoughts about 
how to make the API better. The use case is: there are some additional 
information (input schema, numPartition, etc.) that Spark should always 
provide, and the implementation only need to write extra code if they need to 
access the additional information.
   
   With the current API, we can:
   1. add more additional information in future versions without breaking 
backward compatibility.
   2. users only need to overwrite `withNumPartitions` and other methods if 
they need to access the additional information.
   
   But there is one drawback: we need to take extra effort to make sure the 
additional information is provided by Spark. It's better to guarantee this at 
compile time.
   
   I think we can improve this API a little bit. For `Table#newWriteBuilder`, 
we can define it as
   ```
   WriteBuilder newWriteBuilder(CaseInsensitiveStringMap options, WriteInfo 
info);
   ```
   While `WriteInfo` is an interface providing additional information:
   ```
   interface WriteInfo {
     String queryId();
     StructType inputDataSchema();
     ...
   }
   ```
   The `WriteInfo` is implemented by Spark and called by data source 
implementations, so we can add more methods in future versions without breaking 
backward compatibility.  The `WriteInfo` can also make sure Spark always 
provide additional information at compile time.
   
   If you guys think it makes sense, we can do it in a followup.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #25945: [SPARK-29248][SQL] Pass in number of partitions to WriteBuilder

Reply via email to