liyunzhang_intel created PIG-4886:
-------------------------------------

             Summary: Add PigSplit#getLocationInfo to fix the NPE found in log 
in spark mode
                 Key: PIG-4886
                 URL: https://issues.apache.org/jira/browse/PIG-4886
             Project: Pig
          Issue Type: Sub-task
            Reporter: liyunzhang_intel
            Assignee: liyunzhang_intel


Use branch code(119f313) to test following pig script in spark mode:
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name);
store D into './testFRJoin.out';
{code}

cat bin/SkewedJoinInput1.txt 
{noformat}
100     apple1  aaa
200     orange1 bbb
300     strawberry      ccc
{noformat}

cat bin/SkewedJoinInput2.txt 
{noformat}
100     apple1
100     apple2
100     apple2
200     orange1
200     orange2
300     strawberry
400     pear
{noformat}

following exception found in log:
{noformat}
java.lang.NullPointerException
        at 
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
        at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at 
org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
        at 
org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
        at 
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
        at 
org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to