[GitHub] spark pull request: [SPARK-10731] [PYSPARK] [SQL] use 1 partition ...

rxin Mon, 21 Sep 2015 23:05:49 -0700

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8852#discussion_r40055049
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -300,7 +300,7 @@ def take(self, num):
             >>> df.take(2)
             [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
             """
    -        return self.limit(num).collect()
    +        return self.coalesce(1).limit(num).collect()
    --- End diff --
    
    Is it possible to just call scala DataFrame.take and get the result? You 
lose the socket thing if the number of rows is enormous, but that doesn't seem 
like a big deal for take.
    
    Note that this changes the behavior of take in Python vs Scala. In Scala, 
you can still get parallelism, whereas in Python you coalesced it into a single 
partition, and as a result the degree of parallelism is now just 1.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10731] [PYSPARK] [SQL] use 1 partition ...

Reply via email to