Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21654#discussion_r216121454
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -375,6 +375,9 @@ def _truncate(self):
return int(self.sql_ctx.getConf(
"spark.sql.repl.eagerEval.truncate", "20"))
+ def __len__(self):
--- End diff --
Can we better just not define this? RDD doesn't have this one too. IMHO,
such allowing bit by bit wouldn't be so ideal .. For example, `columns.py`
ended up with a weird limit:
```python
>>> iter(spark.range(1).id)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
>>> isinstance(spark.range(1).id, collections.Iterable)
True
```
It makes a general sense though.
This `__iter__` can't be removed BTW because we implement `__getitem__` and
`__getattr__` to access columns in dataframes IIRC.
`__repr__` was added because it's commonly used and it had a strong usecase
for notebook, etc. However, for `len()` I wouldn't add it for now. Think about
`if len(df) ...` and it is eagerly evaluated ..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]