[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

holdenk Fri, 14 Sep 2018 11:47:23 -0700

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21654#discussion_r217808692
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -375,6 +375,9 @@ def _truncate(self):
             return int(self.sql_ctx.getConf(
                 "spark.sql.repl.eagerEval.truncate", "20"))
     
    +    def __len__(self):
    --- End diff --
    
    Well those are a bit harder to say, I _think_ `iter` might be reasonable 
(main concern is if folks tried to use `map(lambda x, df)`) but those aren't 
the parts of the API we're talking about right now and is starting to boarder 
on a broader design decision we should consider taking to the list. Given the 
timeline of 3 this seems like a good time to have these discussions anyways -- 
maybe we can look at Dask for some inspiration on how to provide a more python 
friendly API while still encouraging good design on the part of our users.
    
    That being said, I think the potential confusion of `iter` or indexing into 
a DF shouldn't block adding other more reasonable helpers.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

Reply via email to