[jira] [Comment Edited] (GOBBLIN-385) Add Spark execution mode for Gobblin

Vinoth Chandar (JIRA) Thu, 29 Mar 2018 08:34:14 -0700

    [ 
https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419203#comment-16419203
 ]


Vinoth Chandar edited comment on GOBBLIN-385 at 3/29/18 3:33 PM:
-----------------------------------------------------------------

At a high level, seems like we need a new

 - *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and 
SparkJobLauncher

 - *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and 
override `runWorkUnits` alone (which I think is the parallel work here). This 
class can create the SparkContext (there can be only 1 per jvm) and simply 
reuse existing input/output formats using SparkContext.newAPIHadoopRDD and 
PairRDD.saveAsNewAPIHadoopFile

 

[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]

[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]

 

Above seems very simple, I am sure once we actually try doing it, we may see 
standard spark issues  like NotSerializable exceptions or some MR specific 
paths.

 

Do you have the Samza patch sitting in a diff somewhere? Could be useful to 
checkout before I embark on trying this out for reals.

 

 

 

 


was (Author: vinothchandar):
At a high level, seems like we need a new

 - *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and 
SparkJobLauncher

 - *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and 
override `runWorkUnits` alone (which I think is the parallel work here). This 
class can create the SparkContext (there can be only 1 per jvm) and simply 
reuse existing input/output formats using SparkContext.newAPIHadoopRDD and 
PairRDD.saveAsNewAPIHadoopFile

 

[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]

[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]

 

Above seems very simple, I am sure once we actually try doing it, we may see 
standard spark issues  like NotSerializable exceptions or some MR specific 
paths.

 

Do you have the Samza patch sitting in a diff somewhere? Could be useful to 
checkout before I embark on trying this out for reals.

 

 

 

 

> Add Spark execution mode for Gobblin
> ------------------------------------
>
>                 Key: GOBBLIN-385
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-385
>             Project: Apache Gobblin
>          Issue Type: New Feature
>          Components: gobblin-cluster
>            Reporter: Vinoth Chandar
>            Assignee: Hung Tran
>            Priority: Major
>
> If there is interest, happy to contribute spark execution mode and eventually 
> add support for ingesting data into [https://github.com/uber/hudi] format..
> Please provide some guidance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (GOBBLIN-385) Add Spark execution mode for Gobblin

Reply via email to