[jira] [Comment Edited] (BEAM-3192) Be able to specify the Spark Partitioner via the pipeline options

Tim Robertson (JIRA) Wed, 15 Nov 2017 02:06:13 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253214#comment-16253214
 ]


Tim Robertson edited comment on BEAM-3192 at 11/15/17 10:05 AM:
----------------------------------------------------------------

A use case for this is when one has iterative algorithms requiring merging of 
RDDs.  There are cases when you can make significant performance improvements 
by being able to colocate the RDDs that will be merged.

One implementation is the maps on [GBIF.org|https://www.gbif.org] (e.g. 
[Animals|https://www.gbif.org/species/1], 
[Birds|https://www.gbif.org/species/212], 
[Sparrows|https://www.gbif.org/species/2492321]) which are recalculated every 
few hours in Spark jobs coordinated by Oozie, and persisted in HBase.  This 
relies on using Spark partitioning to [merge zoom levels up to world 
views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142]
 efficiently.  

Another use case might be building HFiles offline in Spark for [efficient 
loading into 
HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/]
 which requires a {{repartitionAndSortWithinPartition}} operation.


was (Author: timrobertson100):
A use case for this is when one has iterative algorithms requiring merging of 
RDDs.  There are cases when you can make significant performance improvements 
by being able to colocate the RDDs that will be merged.

One implementation is the maps on GBIF.org (e.g. 
[Animals|https://www.gbif.org/species/1], 
[Birds|https://www.gbif.org/species/212], 
[Sparrows|https://www.gbif.org/species/2492321]) which are recalculated every 
few hours in Spark jobs coordinated by Oozie, and persisted in HBase.  This 
relies on using Spark partitioning to [merge zoom levels up to world 
views|https://github.com/gbif/maps/blob/master/spark-process/src/main/scala/org/gbif/maps/spark/BackfillTiles.scala#L142]
 efficiently.  

Another use case might be building HFiles offline in Spark for [efficient 
loading into 
HBase|http://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/]
 which requires a {{repartitionAndSortWithinPartition}} operation.

> Be able to specify the Spark Partitioner via the pipeline options
> -----------------------------------------------------------------
>
>                 Key: BEAM-3192
>                 URL: https://issues.apache.org/jira/browse/BEAM-3192
>             Project: Beam
>          Issue Type: New Feature
>          Components: runner-spark
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
>
> As we did for the StorageLevel, it would be great for an user to be able to 
> provide the Spark partitionner via PipelineOptions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BEAM-3192) Be able to specify the Spark Partitioner via the pipeline options

Reply via email to