[jira] [Commented] (PIG-4970) Remove the deserialize and serialization of JobConf in code for spark mode

liyunzhang_intel (JIRA) Tue, 23 Aug 2016 01:05:28 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432343#comment-15432343
 ]


liyunzhang_intel commented on PIG-4970:
---------------------------------------

[~kexianda]:
Yes, in previous code, we set "jobConf.set("pig.cachedbag.type","default")" in 
PackageConverter.java  in join/group case. But when i search that in mr and tez 
code, it does not set this item externally and no unit test fails even when i 
remove this item in job configuration. So is there any reason we need set this 
item?  If we still need that item, we need to add this item to job 
configuration in PackageConverter and JoinGroupSparkConverter. Because in 
previous way, this item is only set to one spark operator and it is set to the 
whole spark plan in current way.

> Remove the deserialize and serialization of JobConf in code for spark mode
> --------------------------------------------------------------------------
>
>                 Key: PIG-4970
>                 URL: https://issues.apache.org/jira/browse/PIG-4970
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4970.patch
>
>
> Now we use KryoSerializer to serialize the jobConf in 
> [SparkLauncher|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java#L191].
>  then 
> deserialize it in 
> [ForEachConverter|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ForEachConverter.java#L83],
>   
> [StreamConverter|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L70].
>    We deserialize and serialize the jobConf in order to make jobConf 
> available in spark executor thread.
> We can refactor it in following ways:
> 1. Let spark to broadcast the jobConf in 
> [sparkContext.newAPIHadoopRDD|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LoadConverter.java#L102].
>  Here not create a new jobConf and load properties from PigContext but 
> directly use jobConf from SparkLauncher.
> 2. get jobConf in 
> [org.apache.pig.backend.hadoop.executionengine.spark.running.PigInputFormatSpark#createRecordReader|https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/running/PigInputFormatSpark.java#L42]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4970) Remove the deserialize and serialization of JobConf in code for spark mode

Reply via email to