[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

davies Fri, 26 Feb 2016 14:34:44 -0800

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11347#issuecomment-189510320
  
    In Scala, it's clear that DataFrame is Dataset[Row], and some of function 
could work with DataFrame, some may not, compiler could check the types.
    
    But in Python, it's confusing to me, sometimes the record is Row object, 
sometimes the record is just arbitrary object (for example, int). Especially 
when we create a new DataFrame, for example, `range()` or `text()`, these will 
return an DataFrame of Row or DataFrame of int/string?
    
    Before this PR, it's clear that Python DataFrame always has Row with known 
schema with it. `df.rdd` or `df.map` will return an RDD, which could have 
arbitrary object in it. Will it make sense to have Dataset to replace RDD for 
DataFrame, to replace DataFrame?
    
    for example:
    
     df.rdd returns an RDD
     df.ds returns a Dataset
     df.map() return a Dataset



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13233][SQL] Python Dataset (basic versi...

Reply via email to