[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly

liyunzhang_intel (JIRA) Wed, 02 Nov 2016 20:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631391#comment-15631391
 ]


liyunzhang_intel commented on PIG-5052:
---------------------------------------

[~szita]: thanks for review.
{quote}
 I think sparkContext.getConf().getAppId() will return the same value for the 
same spark context. That means that (since we're not creating a new spark 
context every time we run a job) that more jobs will get the same ID. Would 
that still be fine for our use cases (etc. org.apache.pig.builtin.RANDOM#exec) ?
{quote}
 Currently in pig on spark, in most cases 1 physical plan will be converted to 
1 spark job except multiquery case like
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = foreach A generate id,name,RANDOM();
C = foreach A generate name,n,RANDOM();
store B into './multiQ.1.out';
store C into './multiQ.2.out';
explain B;
{code}
{code}
Spark node scope-36
Split - scope-42
|   |
|   B: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.1.out:org.apache.pig.builtin.PigStorage)
 - scope-26
|   |
|   |---B: New For Each(false,false,false)[bag] - scope-25
|       |   |
|       |   Project[bytearray][0] - scope-20
|       |   |
|       |   Project[bytearray][1] - scope-22
|       |   |
|       |   POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-24
|   |
|   C: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.2.out:org.apache.pig.builtin.PigStorage)
 - scope-35
|   |
|   |---C: New For Each(false,false,false)[bag] - scope-34
|       |   |
|       |   Project[bytearray][1] - scope-29
|       |   |
|       |   Project[bytearray][2] - scope-31
|       |   |
|       |   POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-33
|
|---A: New For Each(false,false,false)[bag] - scope-16
    |   |
    |   Project[bytearray][0] - scope-10
    |   |
    |   Project[bytearray][1] - scope-12
    |   |
    |   Project[bytearray][2] - scope-14
    |
    |---A: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage)
 - scope-9--------
{code}
  This multiquery case will generate two spark jobs but they have same 
application id.   what you pointed is really a good catch. But i think it will 
*not* influence the output of RANDOM#exec. Because the jobId in mr is more 
closed to application id in spark in multiquery case because it will only 
generate 1 mr job in above multiquery case.



> Initialize MRConfiguration.JOB_ID in spark mode correctly
> ---------------------------------------------------------
>
>                 Key: PIG-5052
>                 URL: https://issues.apache.org/jira/browse/PIG-5052
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-5052.patch
>
>
> currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf.  
> we just set the value as a random string.
> {code}
>         jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString());
> {code}
> We need to find a spark api to initiliaze it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-5052) Initialize MRConfiguration.JOB_ID in spark mode correctly

Reply via email to