landlord-matt commented on PR #43787:
URL: https://github.com/apache/spark/pull/43787#issuecomment-1816472850
@zhengruifeng: Interesting, because the problem I had was that Python has
the built in types list and set and they have different properties. If I
declare a variable as a set, it will ignore duplicates going forward, but if I
declare it a list, it will accept it. In Java/Scala they differentiate arrays,
lists and sets.
I think the naming refers to that the collection step behaves like a list or
a set, but it returns a standard array and will behave like a standard array in
subsequent steps. If you have memorized the available types in Spark, it is
perhaps obvious that it will be an ArrayType, but I think I think the
documentation should be clear on the specific return type for beginners as
well.
How about if we use the official terminology ArrayType? They it is not
possible to associate it with other arrays in Python.
```
def collect_list(col: "ColumnOrName") -> Column:
"""
Aggregate function: Collects the values from a column into a list,
maintaining duplicates, and returns this list of objects as an ArrayType.
[...]
Returns
-------
:class:`~pyspark.sql.Column`
A new Column object representing a list of collected values as an
ArrayType, duplicates excluded.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]