[ https://issues.apache.org/jira/browse/PIG-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated PIG-5212: ---------------------------------- Attachment: PIG-5212.patch after PIG-5212.patch. The spark plan changes to {code} - scope-57-------- scope-51->scope-71 scope-56->scope-71 scope-71 #-------------------------------------------------- # Spark Plan #-------------------------------------------------- Spark node scope-51 Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-52 | |---a: Load(hdfs://zly1.sh.intel.com:8020/user/root/studenttab10k.mk:org.apache.pig.builtin.PigStorage) - scope-36-------- Spark node scope-71 c: Store(hdfs://zly1.sh.intel.com:8020/user/root/skewed.out:org.apache.pig.builtin.PigStorage) - scope-50 | |---c: SkewedJoin[tuple] - scope-49 | | | Project[bytearray][0] - scope-47 | | | Project[bytearray][0] - scope-48 | |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-36 | |---b: Filter[bag] - scope-42 | | | Greater Than[boolean] - scope-46 | | | |---Cast[int] - scope-44 | | | | | |---Project[bytearray][1] - scope-43 | | | |---Constant(25) - scope-45 | |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-54-------- Spark node scope-56 BroadcastSpark - scope-70 | |---New For Each(false)[tuple] - scope-69 | | | POUserFunc(org.apache.pig.impl.builtin.PartitionSkewedKeys)[tuple] - scope-68 | | | |---Project[tuple][*] - scope-67 | |---New For Each(false,false)[tuple] - scope-66 | | | Constant(7) - scope-65 | | | Project[bag][1] - scope-64 | |---POSparkSort[tuple]() - scope-49 | | | Project[bytearray][0] - scope-47 | |---New For Each(false,true)[tuple] - scope-63 | | | Project[bytearray][0] - scope-47 | | | POUserFunc(org.apache.pig.impl.builtin.GetMemNumRows)[tuple] - scope-61 | | | |---Project[tuple][*] - scope-60 | |---PoissonSampleSpark - scope-62 | |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-2120872783/tmp-1165619913:org.apache.pig.impl.io.InterStorage) - scope-57-------- {code} the difference between current spark and previous spark plan is the predecessor of SkewedJoin(scope-49) is Load(scope-36) and Filter(scope-42). set the operatorKey of poload in SparkCompiler#startNew when SparkCompiler#visitSplit is called (POload in Spark node scope-74 is same as the POload in Spark node scope-51 in OperatorKey) > SkewedJoin_6 is failing on Spark > -------------------------------- > > Key: PIG-5212 > URL: https://issues.apache.org/jira/browse/PIG-5212 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: Nandor Kollar > Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-5212.patch > > > result are different: > {code} > diff <(head -20 SkewedJoin_6_benchmark.out/out_sorted) <(head -20 > SkewedJoin_6.out/out_sorted) > < alice allen 19 1.930 alice allen 27 1.950 > < alice allen 19 1.930 alice allen 34 1.230 > < alice allen 19 1.930 alice allen 36 2.270 > < alice allen 19 1.930 alice allen 38 0.810 > < alice allen 19 1.930 alice allen 38 1.800 > < alice allen 19 1.930 alice allen 42 2.460 > < alice allen 19 1.930 alice allen 43 0.880 > < alice allen 19 1.930 alice allen 45 2.800 > < alice allen 19 1.930 alice allen 46 3.970 > < alice allen 19 1.930 alice allen 51 1.080 > < alice allen 19 1.930 alice allen 68 3.390 > < alice allen 19 1.930 alice allen 68 3.510 > < alice allen 19 1.930 alice allen 72 1.750 > < alice allen 19 1.930 alice allen 72 3.630 > < alice allen 19 1.930 alice allen 74 0.020 > < alice allen 19 1.930 alice allen 74 2.400 > < alice allen 19 1.930 alice allen 77 2.520 > < alice allen 20 2.470 alice allen 27 1.950 > < alice allen 20 2.470 alice allen 34 1.230 > < alice allen 20 2.470 alice allen 36 2.270 > --- > > alice allen 27 1.950 alice allen 19 1.930 > > alice allen 27 1.950 alice allen 20 2.470 > > alice allen 27 1.950 alice allen 27 1.950 > > alice allen 27 1.950 alice allen 34 1.230 > > alice allen 27 1.950 alice allen 36 2.270 > > alice allen 27 1.950 alice allen 38 0.810 > > alice allen 27 1.950 alice allen 38 1.800 > > alice allen 27 1.950 alice allen 42 2.460 > > alice allen 27 1.950 alice allen 43 0.880 > > alice allen 27 1.950 alice allen 45 2.800 > > alice allen 27 1.950 alice allen 46 3.970 > > alice allen 27 1.950 alice allen 51 1.080 > > alice allen 27 1.950 alice allen 68 3.390 > > alice allen 27 1.950 alice allen 68 3.510 > > alice allen 27 1.950 alice allen 72 1.750 > > alice allen 27 1.950 alice allen 72 3.630 > > alice allen 27 1.950 alice allen 74 0.020 > > alice allen 27 1.950 alice allen 74 2.400 > > alice allen 27 1.950 alice allen 77 2.520 > > alice allen 34 1.230 alice allen 19 1.930 > {code} > It looks like the two tables are in wrong order, columns from 'a' should come > first, then columns from 'b'. In spark mode this is inverted. -- This message was sent by Atlassian JIRA (v6.3.15#6346)