Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11347#issuecomment-189510320
In Scala, it's clear that DataFrame is Dataset[Row], and some of function
could work with DataFrame, some may not, compiler could check the types.
But in Python, it's confusing to me, sometimes the record is Row object,
sometimes the record is just arbitrary object (for example, int). Especially
when we create a new DataFrame, for example, `range()` or `text()`, these will
return an DataFrame of Row or DataFrame of int/string?
Before this PR, it's clear that Python DataFrame always has Row with known
schema with it. `df.rdd` or `df.map` will return an RDD, which could have
arbitrary object in it. Will it make sense to have Dataset to replace RDD for
DataFrame, to replace DataFrame?
for example:
df.rdd returns an RDD
df.ds returns a Dataset
df.map() return a Dataset
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]