[
https://issues.apache.org/jira/browse/BEAM-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253214#comment-16253214
]
Tim Robertson commented on BEAM-3192:
-------------------------------------
A use case for this is when one has iterative algorithms requiring merging of
RDDs. There are cases when you can make significant performance improvements
by being able to colocate the RDDs that will be merged.
One implementation is the maps on GBIF.org (e.g.
[Animals|https://www.gbif.org/species/1],
[Birds|https://www.gbif.org/species/212],
[Sparrows|https://www.gbif.org/species/2492321]) which are recalculated every
few hours in Spark jobs coordinated by Oozie, and persisted in HBase. This
relies on using Spark partitioning to [merge zoom levels up to world
views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142]
efficiently.
Another use case might be building HFiles offline in Spark for [efficient
loading into
HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/]
which requires a {{repartitionAndSortWithinPartition}} operation.
> Be able to specify the Spark Partitioner via the pipeline options
> -----------------------------------------------------------------
>
> Key: BEAM-3192
> URL: https://issues.apache.org/jira/browse/BEAM-3192
> Project: Beam
> Issue Type: New Feature
> Components: runner-spark
> Reporter: Jean-Baptiste Onofré
> Assignee: Jean-Baptiste Onofré
>
> As we did for the StorageLevel, it would be great for an user to be able to
> provide the Spark partitionner via PipelineOptions.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)