Github user olarayej commented on the pull request:

    https://github.com/apache/spark/pull/11336#issuecomment-203672948
  
    Thanks @sun-rui @rxin @shivaram  for your inputs. To alleviate the 
confusion on which columns can/cannot be collected, I propose the following 
(already pushed the code):
    
    Currently there are 15 SparkR functions that return an ‘orphan’ Column 
with no parent DataFrame:
    ```
    rand, rand, unix_timestamp,
    struct, expr, column, lag, lead, lit, cume_dist, dense_rank,
    ntile, percent_rank, rank, row_number
    ```
    The first three (i.e., rand, randn, and unix_timestamp) can be nicely 
collected as single elements. For example:
    ```
    > rand()
    [1] 0.01483325
    ```
    The remaining ones don’t make sense unless there’s an associated 
DataFrame. Therefore, an empty vector will be returned:
    ```
    > column("Species")
    Species
    <Empty column>
    
    > collect(column("Species"))
    character(0)
    ```
    
    I think it makes sense: If you don’t associate a Column with a DataFrame, 
there’s nothing to be collected. Now, for Columns that do belong to a 
DataFrame, collecting columns SIGNIFICANTLY improves usability in 138 
functions/operators (besides other issues in the design document), for example:
    
    > irisDF$Sepal_Length * 100
     [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 
510 570 510…
    
    versus:
    
    > head(select(irisDF, irisDF$Sepal_Length * 100), 20)[, 1]
     [1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540 
510 570 510
    
    @shivaram has a very valid point: this introduces discrepancies in the 
Spark API’s across multiple languages. I believe this is not necessarily bad 
as R, especially, is a slightly different animal which already has a specific 
behavior for columns (i.e., vectors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to