[ 
https://issues.apache.org/jira/browse/PIG-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-5212:
----------------------------------
    Attachment: PIG-5212.patch

after PIG-5212.patch.

The spark plan changes to
{code}
- scope-57--------
scope-51->scope-71
scope-56->scope-71
scope-71
#--------------------------------------------------
# Spark Plan                                 
#--------------------------------------------------

Spark node scope-51
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage)
 - scope-52
|
|---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage)
 - scope-36--------

Spark node scope-71
c: 
Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage)
 - scope-50
|
|---c: SkewedJoin[tuple] - scope-49
    |   |
    |   Project[bytearray][0] - scope-47
    |   |
    |   Project[bytearray][0] - scope-48
    |
    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage)
 - scope-36
    |
    |---b: Filter[bag] - scope-42
        |   |
        |   Greater Than[boolean] - scope-46
        |   |
        |   |---Cast[int] - scope-44
        |   |   |
        |   |   |---Project[bytearray][1] - scope-43
        |   |
        |   |---Constant(25) - scope-45
        |
        
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage)
 - scope-54--------

Spark node scope-56
BroadcastSpark - scope-70
|
|---New For Each(false)[tuple] - scope-69
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] - 
scope-68
    |   |
    |   |---Project[tuple][*] - scope-67
    |
    |---New For Each(false,false)[tuple] - scope-66
        |   |
        |   Constant(7) - scope-65
        |   |
        |   Project[bag][1] - scope-64
        |
        |---POSparkSort[tuple]() - scope-49
            |   |
            |   Project[bytearray][0] - scope-47
            |
            |---New For Each(false,true)[tuple] - scope-63
                |   |
                |   Project[bytearray][0] - scope-47
                |   |
                |   
POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-61
                |   |
                |   |---Project[tuple][*] - scope-60
                |
                |---PoissonSampleSpark - scope-62
                    |
                    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage)
 - scope-57--------
{code}

the difference between current spark and previous spark plan is the predecessor 
of SkewedJoin(scope-49) is Load(scope-36) and Filter(scope-42).  set the 
operatorKey of poload in SparkCompiler#startNew when SparkCompiler#visitSplit 
is called (POload in Spark node scope-74 is same as the POload in Spark node 
scope-51 in OperatorKey)

> SkewedJoin_6 is failing on Spark
> --------------------------------
>
>                 Key: PIG-5212
>                 URL: https://issues.apache.org/jira/browse/PIG-5212
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Nandor Kollar
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-5212.patch
>
>
> result are different:
> {code}
> diff <(head -20 SkewedJoin_6_benchmark.out/out_sorted) <(head -20 
> SkewedJoin_6.out/out_sorted)
> < alice allen 19      1.930   alice allen     27      1.950
> < alice allen 19      1.930   alice allen     34      1.230
> < alice allen 19      1.930   alice allen     36      2.270
> < alice allen 19      1.930   alice allen     38      0.810
> < alice allen 19      1.930   alice allen     38      1.800
> < alice allen 19      1.930   alice allen     42      2.460
> < alice allen 19      1.930   alice allen     43      0.880
> < alice allen 19      1.930   alice allen     45      2.800
> < alice allen 19      1.930   alice allen     46      3.970
> < alice allen 19      1.930   alice allen     51      1.080
> < alice allen 19      1.930   alice allen     68      3.390
> < alice allen 19      1.930   alice allen     68      3.510
> < alice allen 19      1.930   alice allen     72      1.750
> < alice allen 19      1.930   alice allen     72      3.630
> < alice allen 19      1.930   alice allen     74      0.020
> < alice allen 19      1.930   alice allen     74      2.400
> < alice allen 19      1.930   alice allen     77      2.520
> < alice allen 20      2.470   alice allen     27      1.950
> < alice allen 20      2.470   alice allen     34      1.230
> < alice allen 20      2.470   alice allen     36      2.270
> ---
> > alice allen 27      1.950   alice allen     19      1.930
> > alice allen 27      1.950   alice allen     20      2.470
> > alice allen 27      1.950   alice allen     27      1.950
> > alice allen 27      1.950   alice allen     34      1.230
> > alice allen 27      1.950   alice allen     36      2.270
> > alice allen 27      1.950   alice allen     38      0.810
> > alice allen 27      1.950   alice allen     38      1.800
> > alice allen 27      1.950   alice allen     42      2.460
> > alice allen 27      1.950   alice allen     43      0.880
> > alice allen 27      1.950   alice allen     45      2.800
> > alice allen 27      1.950   alice allen     46      3.970
> > alice allen 27      1.950   alice allen     51      1.080
> > alice allen 27      1.950   alice allen     68      3.390
> > alice allen 27      1.950   alice allen     68      3.510
> > alice allen 27      1.950   alice allen     72      1.750
> > alice allen 27      1.950   alice allen     72      3.630
> > alice allen 27      1.950   alice allen     74      0.020
> > alice allen 27      1.950   alice allen     74      2.400
> > alice allen 27      1.950   alice allen     77      2.520
> > alice allen 34      1.230   alice allen     19      1.930
> {code}
> It looks like the two tables are in wrong order, columns from 'a' should come 
> first, then columns from 'b'. In spark mode this is inverted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to