cloud-fan commented on a change in pull request #25945: [SPARK-29248][SQL] Pass
in number of partitions to WriteBuilder
URL: https://github.com/apache/spark/pull/25945#discussion_r328954607
##########
File path:
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/WriteBuilder.java
##########
@@ -55,6 +55,16 @@ default WriteBuilder withInputDataSchema(StructType schema)
{
return this;
}
+ /**
+ * Passes the number of partitions of the input data from Spark to data
source.
+ *
+ * @return a new builder with the `schema`. By default it returns `this`,
which means the given
+ * `numPartitions` is ignored. Please override this method to take
the `numPartitions`.
+ */
+ default WriteBuilder withNumPartitions(int numPartitions) {
Review comment:
I'm OK with the approach here, but just want to share a few thoughts about
how to make the API better. The use case is: there are some additional
information (input schema, numPartition, etc.) that Spark should always
provide, and the implementation only need to write extra code if they need to
access the additional information.
With the current API, we can:
1. add more additional information in future versions without breaking
backward compatibility.
2. users only need to overwrite `withNumPartitions` and other methods if
they need to access the additional information.
But there is one drawback: we need to take extra effort to make sure the
additional information is provided by Spark. It's better to guarantee this at
compile time.
I think we can improve this API a little bit. For `Table#newWriteBuilder`,
we can define it as
```
WriteBuilder newWriteBuilder(CaseInsensitiveStringMap options, WriteInfo
info);
```
While `WriteInfo` is an interface providing additional information:
```
interface WriteInfo {
String queryId();
StructType inputDataSchema();
...
}
```
The `WriteInfo` is implemented by Spark and called by data source
implementations, so we can add more methods in future versions without breaking
backward compatibility. The `WriteInfo` can also make sure Spark always
provide additional information at compile time.
If you guys think it makes sense, we can do it in a followup.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]