Github user olarayej commented on the pull request:
https://github.com/apache/spark/pull/11336#issuecomment-203672948
Thanks @sun-rui @rxin @shivaram for your inputs. To alleviate the
confusion on which columns can/cannot be collected, I propose the following
(already pushed the code):
Currently there are 15 SparkR functions that return an âorphanâ Column
with no parent DataFrame:
```
rand, rand, unix_timestamp,
struct, expr, column, lag, lead, lit, cume_dist, dense_rank,
ntile, percent_rank, rank, row_number
```
The first three (i.e., rand, randn, and unix_timestamp) can be nicely
collected as single elements. For example:
```
> rand()
[1] 0.01483325
```
The remaining ones donât make sense unless thereâs an associated
DataFrame. Therefore, an empty vector will be returned:
```
> column("Species")
Species
<Empty column>
> collect(column("Species"))
character(0)
```
I think it makes sense: If you donât associate a Column with a DataFrame,
thereâs nothing to be collected. Now, for Columns that do belong to a
DataFrame, collecting columns SIGNIFICANTLY improves usability in 138
functions/operators (besides other issues in the design document), for example:
> irisDF$Sepal_Length * 100
[1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540
510 570 510â¦
versus:
> head(select(irisDF, irisDF$Sepal_Length * 100), 20)[, 1]
[1] 510 490 470 460 500 540 460 500 440 490 540 480 480 430 580 570 540
510 570 510
@shivaram has a very valid point: this introduces discrepancies in the
Spark APIâs across multiple languages. I believe this is not necessarily bad
as R, especially, is a slightly different animal which already has a specific
behavior for columns (i.e., vectors).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]