Are you just looking for DataFrame.repartition()?

On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> ------------------------------------------------------------------
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> <https://sites.bu.edu/casp/people/ekritharakis/>
>
> Boston University
>

Reply via email to