inpefess commented on a change in pull request #20691: [SPARK-18161] [Python] 
Allow pickle to serialize >4 GB objects when possible (Python 3.4+)
URL: https://github.com/apache/spark/pull/20691#discussion_r247879812
 
 

 ##########
 File path: python/pyspark/tests/test_rdd.py
 ##########
 @@ -605,7 +605,7 @@ def test_distinct(self):
 
     def test_external_group_by_key(self):
         self.sc._conf.set("spark.python.worker.memory", "1m")
-        N = 200001
+        N = 2000001
 
 Review comment:
   Well, if the object that you'd serialised is too large to fit in memory 
(like here 
https://github.com/apache/spark/blob/05cf81e6de3d61ddb0af81cd179665693f23351f/python/pyspark/shuffle.py#L774),
 then the `result.data` will have type `shuffle.ExternalListOfList`. If the 
object is small, then the type will be just `list`.
   When using protocol version 4 (since the test fails only for Python 3.4) we 
serialise more efficiently, so we need more data not to fit in memory.
   Is that enough to justify the change? @HyukjinKwon 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to