[
https://issues.apache.org/jira/browse/PIG-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631391#comment-15631391
]
liyunzhang_intel commented on PIG-5052:
---------------------------------------
[~szita]: thanks for review.
{quote}
I think sparkContext.getConf().getAppId() will return the same value for the
same spark context. That means that (since we're not creating a new spark
context every time we run a job) that more jobs will get the same ID. Would
that still be fine for our use cases (etc. org.apache.pig.builtin.RANDOM#exec) ?
{quote}
Currently in pig on spark, in most cases 1 physical plan will be converted to
1 spark job except multiquery case like
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = foreach A generate id,name,RANDOM();
C = foreach A generate name,n,RANDOM();
store B into './multiQ.1.out';
store C into './multiQ.2.out';
explain B;
{code}
{code}
Spark node scope-36
Split - scope-42
| |
| B:
Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.1.out:org.apache.pig.builtin.PigStorage)
- scope-26
| |
| |---B: New For Each(false,false,false)[bag] - scope-25
| | |
| | Project[bytearray][0] - scope-20
| | |
| | Project[bytearray][1] - scope-22
| | |
| | POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-24
| |
| C:
Store(hdfs://zly1.sh.intel.com:8020/user/root/multiQ.2.out:org.apache.pig.builtin.PigStorage)
- scope-35
| |
| |---C: New For Each(false,false,false)[bag] - scope-34
| | |
| | Project[bytearray][1] - scope-29
| | |
| | Project[bytearray][2] - scope-31
| | |
| | POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-33
|
|---A: New For Each(false,false,false)[bag] - scope-16
| |
| Project[bytearray][0] - scope-10
| |
| Project[bytearray][1] - scope-12
| |
| Project[bytearray][2] - scope-14
|
|---A:
Load(hdfs://zly1.sh.intel.com:8020/user/root/SkewedJoinInput1.txt:org.apache.pig.builtin.PigStorage)
- scope-9--------
{code}
This multiquery case will generate two spark jobs but they have same
application id. what you pointed is really a good catch. But i think it will
*not* influence the output of RANDOM#exec. Because the jobId in mr is more
closed to application id in spark in multiquery case because it will only
generate 1 mr job in above multiquery case.
> Initialize MRConfiguration.JOB_ID in spark mode correctly
> ---------------------------------------------------------
>
> Key: PIG-5052
> URL: https://issues.apache.org/jira/browse/PIG-5052
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5052.patch
>
>
> currently, we initialize MRConfiguration.JOB_ID in SparkUtil#newJobConf.
> we just set the value as a random string.
> {code}
> jobConf.set(MRConfiguration.JOB_ID, UUID.randomUUID().toString());
> {code}
> We need to find a spark api to initiliaze it correctly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)