[ https://issues.apache.org/jira/browse/PIG-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288630#comment-15288630 ]
liyunzhang_intel commented on PIG-4898: --------------------------------------- the reason why following unit tests fail is 1. org.apache.pig.test.TestFRJoin.testDistinctFRJoin 2. org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3 #1 is because we missing following code thus throw NPE. SparkCompiler#visitDistinct {code} public void visitDistinct(PODistinct op) throws VisitorException { try { addToPlan(op); + phyToSparkOpMap.put(op, curSparkOp); } catch (Exception e) { int errCode = 2034; .... {code} 2# is because now we don't replace FRJoin with regular join. TestPigRunner#simpleMultiQueryTest3.pig {code} A = load '" + INPUT_FILE + "' as (a0:int, a1:int, a2:int); A1 = load '" + INPUT_FILE_2 + "' as (a0:int, a1:int, a2:int); B = filter A by a0 == 3; C = filter A by a1 <=5; D = join C by a0, B by a0, A1 by a0 using 'replicated'; store C into '" + OUTPUT_FILE; store D into '" + OUTPUT_FILE_2 {code} before when we use regular join to implement it, the spark plan is {noformat} #-------------------------------------------------- # Spark Plan #-------------------------------------------------- Spark node scope-521 Store(hdfs://localhost:59787/tmp/temp1693227580/tmp480394712:org.apache.pig.impl.io.InterStorage) - scope-522 | |---A: New For Each(false,false,false)[bag] - scope-478 | | | Cast[int] - scope-470 | | | |---Project[bytearray][0] - scope-469 | | | Cast[int] - scope-473 | | | |---Project[bytearray][1] - scope-472 | | | Cast[int] - scope-476 | | | |---Project[bytearray][2] - scope-475 | |---A: Load(hdfs://localhost:59787/user/root/input:org.apache.pig.builtin.PigStorage) - scope-468-------- Spark node scope-524 Store(hdfs://localhost:59787/tmp/temp1693227580/tmp-2124870865:org.apache.pig.impl.io.InterStorage) - scope-525 | |---C: Filter[bag] - scope-482 | | | Less Than or Equal[boolean] - scope-485 | | | |---Project[int][1] - scope-483 | | | |---Constant(5) - scope-484 | |---Load(hdfs://localhost:59787/tmp/temp1693227580/tmp480394712:org.apache.pig.impl.io.InterStorage) - scope-523-------- Spark node scope-527 C: Store(hdfs://localhost:59787/user/root/output:org.apache.pig.builtin.PigStorage) - scope-489 | |---Load(hdfs://localhost:59787/tmp/temp1693227580/tmp-2124870865:org.apache.pig.impl.io.InterStorage) - scope-526-------- Spark node scope-533 D: Store(hdfs://localhost:59787/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-520 | |---D: FRJoin[tuple] - scope-512 | | | Project[int][0] - scope-509 | | | Project[int][0] - scope-510 | | | Project[int][0] - scope-511 | |---B: Filter[bag] - scope-494 | | | | | Equal To[boolean] - scope-497 | | | | | |---Project[int][0] - scope-495 | | | | | |---Constant(3) - scope-496 | | | |---Load(hdfs://localhost:59787/tmp/temp1693227580/tmp480394712:org.apache.pig.impl.io.InterStorage) - scope-530 | |---A1: New For Each(false,false,false)[bag] - scope-508 | | | | | Cast[int] - scope-500 | | | | | |---Project[bytearray][0] - scope-499 | | | | | Cast[int] - scope-503 | | | | | |---Project[bytearray][1] - scope-502 | | | | | Cast[int] - scope-506 | | | | | |---Project[bytearray][2] - scope-505 | | | |---A1: Load(hdfs://localhost:59787/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-498 | |---Load(hdfs://localhost:59787/tmp/temp1693227580/tmp-2124870865:org.apache.pig.impl.io.InterStorage) - scope-528-------- {noformat} After PIG-4771 {code} #-------------------------------------------------- # Spark Plan #-------------------------------------------------- Spark node scope-534 Split - scope-548 | | | Store(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-538 | | | |---C: Filter[bag] - scope-495 | | | | | Less Than or Equal[boolean] - scope-498 | | | | | |---Project[int][1] - scope-496 | | | | | |---Constant(5) - scope-497 | | | Store(hdfs://localhost:48350/tmp/temp649016960/tmp804709981:org.apache.pig.impl.io.InterStorage) - scope-546 | | | |---B: Filter[bag] - scope-507 | | | | | Equal To[boolean] - scope-510 | | | | | |---Project[int][0] - scope-508 | | | | | |---Constant(3) - scope-509 | |---A: New For Each(false,false,false)[bag] - scope-491 | | | Cast[int] - scope-483 | | | |---Project[bytearray][0] - scope-482 | | | Cast[int] - scope-486 | | | |---Project[bytearray][1] - scope-485 | | | Cast[int] - scope-489 | | | |---Project[bytearray][2] - scope-488 | |---A: Load(hdfs://localhost:48350/user/root/input:org.apache.pig.builtin.PigStorage) - scope-481-------- Spark node scope-540 C: Store(hdfs://localhost:48350/user/root/output:org.apache.pig.builtin.PigStorage) - scope-502 | |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-539-------- Spark node scope-542 D: Store(hdfs://localhost:48350/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-533 | |---D: FRJoin[tuple] - scope-525 | | | Project[int][0] - scope-522 | | | Project[int][0] - scope-523 | | | Project[int][0] - scope-524 | |---Load(hdfs://localhost:48350/tmp/temp649016960/tmp48836938:org.apache.pig.impl.io.InterStorage) - scope-541-------- Spark node scope-545 Store(hdfs://localhost:48350/tmp/temp649016960/tmp-2036144538:org.apache.pig.impl.io.InterStorage) - scope-547 | |---A1: New For Each(false,false,false)[bag] - scope-521 | | | Cast[int] - scope-513 | | | |---Project[bytearray][0] - scope-512 | | | Cast[int] - scope-516 | | | |---Project[bytearray][1] - scope-515 | | | Cast[int] - scope-519 | | | |---Project[bytearray][2] - scope-518 | |---A1: Load(hdfs://localhost:48350/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-511-------- {code} assertEquals(4, stats.getJobGraph().size());[code|https://github.com/apache/pig/blob/spark/test/org/apache/pig/test/TestPigRunner.java#L459] fails because now there are 5 stores not 4. But even we modify the value from 4 to 5. This test still fails in assertEquals(5, inputStats.get(0).getNumberRecords()); [code| https://github.com/apache/pig/blob/spark/test/org/apache/pig/test/TestPigRunner.java#L498]. The number of Records of input file is calculated wrongly in spark mode in multiquery case. I will fire new jira to record this but for now if multiquery is not enabled like what i did in the PIG-4988.patch, this issue can be avoided. > Fix unit test failure after PIG-4771's patch was checked in > ----------------------------------------------------------- > > Key: PIG-4898 > URL: https://issues.apache.org/jira/browse/PIG-4898 > Project: Pig > Issue Type: Sub-task > Components: spark > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Fix For: spark-branch > > > Now in the [lastest jenkins|https://builds.apache.org/job/Pig-spark/#328], it > shows that following unit test cases fail: > org.apache.pig.test.TestFRJoin.testDistinctFRJoin > org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)