[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...

viirya Thu, 13 Oct 2016 07:05:11 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/15445
  
    @davies @felixcheung I ran another benchmark as follows:
    
        import time
        import random
    
        num_partitions = 20000
        a = sc.parallelize(map(lambda x: [random.randint(0,1000) for r in 
xrange(20)], range(20000)))
        start = time.time()
        l = a.repartition(num_partitions).glom().map(len).collect()
        end = time.time()
        print(end - start)
    
    _to_java_object_rdd(): 424.308749914
    decreasing the batch size: 425.877130032
    
    The time difference is not obvious.
    
    However, when I ran another benchmark with numpy array. I found that the 
`_to_java_object_rdd()` approach has another problem on unpickling custom 
python object in java side.
    
    When running the following code:
    
        import time
        import numpy as np
    
        num_partitions = 20000
        a = sc.parallelize(map(lambda x: np.random.rand(20), range(20000)), 2)
        start = time.time()
        l = a.repartition(num_partitions).glom().map(len).collect()
        end = time.time()
        print(end - start)
    
    `_to_java_object_rdd()` will throw exception:
    
        : org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
0.0 (TID 0, localhost): net.razorvine.pickle.PickleException: expected zero 
arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
            at 
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
            at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
            at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
            at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
            at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
            at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
            at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
            at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
            at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
            at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
            at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
    
    
    Consider the issue of pickling python object in converting to java rdd, I 
think this PR might be better solution.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15445: [SPARK-17817][PySpark][FOLLOWUP] PySpark RDD Repartition...

Reply via email to