[
https://issues.apache.org/jira/browse/SPARK-29248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938320#comment-16938320
]
Ximo Guanter commented on SPARK-29248:
--------------------------------------
In general, the number of partitions can be used to provision whatever is
needed before starting the actual write. Examples:
* Creating a kafka topic with the same number of partitions as the RDD (that's
my use case)
* Scaling a microservice appropriately which will receive the data via TCP
The fundamental problem for me is the asymmetry between what we require for a
reader implementation and what we provide to a writer implementation. Let's say
I want to create a DataSource that allows me to connect two Spark clusters
(let's say that for legal reasons, there is a pre-processing on the data that
needs to happen in a 3rd party company, and they are also using Spark). Right
now, that DataSource can't be created, because the reader won't be able to know
how many partitions the writer has. A Flink->Spark connector can be built
because Flink provides the necessary info on the writer side, but a
Spark->Flink connector can't be built in DSv2.
Reporting the number of partitions to the writer (the same way the schema is
already being reported) is an easy change and makes for a nicer design imho.
> Pass in number of partitions to BuildWriter
> -------------------------------------------
>
> Key: SPARK-29248
> URL: https://issues.apache.org/jira/browse/SPARK-29248
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Ximo Guanter
> Priority: Major
>
> When implementing a ScanBuilder, we require the implementor to provide the
> schema of the data and the number of partitions.
> However, when someone is implementing WriteBuilder we only pass them the
> schema, but not the number of partitions. This is an asymetrical developer
> experience. Passing in the number of partitions on the WriteBuilder would
> enable data sources to provision their write targets before starting to
> write. For example, it could be used to provision a Kafka topic with a
> specific number of partitions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]