[
https://issues.apache.org/jira/browse/BEAM-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismaël Mejía reassigned BEAM-3022:
----------------------------------
Assignee: (was: Amit Sela)
> Enable the ability to grow partition count in the underlying Spark RDD
> ----------------------------------------------------------------------
>
> Key: BEAM-3022
> URL: https://issues.apache.org/jira/browse/BEAM-3022
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Tim Robertson
> Priority: Major
>
> When using a {{HadoopInputFormatIO}} the number of splits seems to be
> controlled by the underlying {{InputFormat}} which in turn determines the
> number of partitions and therefore parallelisation when running on Spark. It
> is possible to {{Reshuffle}} the data to compensate for data skew, but it
> _appears_ there is no way to grow the number of partitions. The
> {{GroupCombineFunctions.reshuffle}} seems to be the only place calling the
> Spark {{repartition}} and it uses the number of partitions from the original
> RDD.
> Scenarios that would benefit from this:
> # Increasing parallelisation for computationally heavy stages
> # ETLs where the input partitions are dictated by the source while you wish
> to optimise the partitions for fast loading to the target sink
> # Zip files (my case) where they are read in single threaded manner with a
> custom HadoopInputFormat and therefore get a single task for all stages
> (It would be nice if a user could supply a partitioner too, to help dictate
> data locality)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)