Re: [PR] [MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType [spark]

via GitHub Fri, 17 Nov 2023 05:57:49 -0800


landlord-matt commented on PR #43787:
URL: https://github.com/apache/spark/pull/43787#issuecomment-1816472850


   @zhengruifeng: Interesting, because the problem I had was that Python has 
the built in types list and set and they have different properties. If I 
declare a variable as a set, it will ignore duplicates going forward, but if I 
declare it a list, it will accept it. In Java/Scala they differentiate arrays, 
lists and sets. 
   
   I think the naming refers to that the collection step behaves like a list or 
a set, but it returns a standard array and will behave like a standard array in 
subsequent steps. If you have memorized the available types in Spark, it is 
perhaps obvious that it will be an ArrayType, but I think I think the 
documentation should be clear on the specific return type for beginners as 
well. 
   
   How about if we use the official terminology ArrayType? They it is not 
possible to associate it with other arrays in Python.
   
   ```    
   def collect_list(col: "ColumnOrName") -> Column:
       """
       Aggregate function: Collects the values from a column into a list,
       maintaining duplicates, and returns this list of objects as an ArrayType.
   
       [...]
   
       Returns
       -------
       :class:`~pyspark.sql.Column`
           A new Column object representing a list of collected values as an 
ArrayType, duplicates excluded.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [MINOR][DOCS] Clarify collect_list and collect_set -> ArrayType [spark]

Reply via email to