[jira] [Commented] (SPARK-29248) Pass in number of partitions to BuildWriter

Ximo Guanter (Jira) Wed, 25 Sep 2019 23:52:16 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938320#comment-16938320
 ]


Ximo Guanter commented on SPARK-29248:
--------------------------------------

In general, the number of partitions can be used to provision whatever is 
needed before starting the actual write. Examples:

 
 * Creating a kafka topic with the same number of partitions as the RDD (that's 
my use case)
 * Scaling a microservice appropriately which will receive the data via TCP

 

The fundamental problem for me is the asymmetry between what we require for a 
reader implementation and what we provide to a writer implementation. Let's say 
I want to create a DataSource that allows me to connect two Spark clusters 
(let's say that for legal reasons, there is a pre-processing on the data that 
needs to happen in a 3rd party company, and they are also using Spark). Right 
now, that DataSource can't be created, because the reader won't be able to know 
how many partitions the writer has. A Flink->Spark connector can be built 
because Flink provides the necessary info on the writer side, but a 
Spark->Flink connector can't be built in DSv2.

 

Reporting the number of partitions to the writer (the same way the schema is 
already being reported) is an easy change and makes for a nicer design imho.

> Pass in number of partitions to BuildWriter
> -------------------------------------------
>
>                 Key: SPARK-29248
>                 URL: https://issues.apache.org/jira/browse/SPARK-29248
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Ximo Guanter
>            Priority: Major
>
> When implementing a ScanBuilder, we require the implementor to provide the 
> schema of the data and the number of partitions.
> However, when someone is implementing WriteBuilder we only pass them the 
> schema, but not the number of partitions. This is an asymetrical developer 
> experience. Passing in the number of partitions on the WriteBuilder would 
> enable data sources to provision their write targets before starting to 
> write. For example, it could be used to provision a Kafka topic with a 
> specific number of partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-29248) Pass in number of partitions to BuildWriter

Reply via email to