[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

mateiz Wed, 23 Jul 2014 10:51:14 -0700

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1460#discussion_r15303224
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1207,20 +1225,49 @@ def partitionBy(self, numPartitions, 
partitionFunc=portable_hash):
             if numPartitions is None:
                 numPartitions = self._defaultReducePartitions()
     
    -        # Transferring O(n) objects to Java is too expensive.  Instead, 
we'll
    -        # form the hash buckets in Python, transferring O(numPartitions) 
objects
    -        # to Java.  Each object is a (splitNumber, [objects]) pair.
    +        # Transferring O(n) objects to Java is too expensive.
    +        # Instead, we'll form the hash buckets in Python,
    +        # transferring O(numPartitions) objects to Java.
    +        # Each object is a (splitNumber, [objects]) pair.
    +        # In order to void too huge objects, the objects are
    +        # grouped into chunks.
             outputSerializer = self.ctx._unbatched_serializer
     
    +        limit = (_parse_memory(self.ctx._conf.get(
    +                    "spark.python.worker.memory", "512m") / 2)
    --- End diff --
    
    Could you get the same problem of increasing the limit here that you have 
in ExternalMerger? (If Python doesn't free memory right away)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

Reply via email to