[jira] [Commented] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

Alexander Alexandrov (JIRA) Mon, 23 Jan 2017 03:11:46 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834278#comment-15834278
 ]


Alexander Alexandrov commented on SPARK-19214:
----------------------------------------------

I just realized that the discussion until now has been related to Problem (1). 
Just to make it clear, my concerns are mostly with Problem (2) because it is 
more confusing to programmers.

> Inconsistencies between DataFrame and Dataset APIs
> --------------------------------------------------
>
>                 Key: SPARK-19214
>                 URL: https://issues.apache.org/jira/browse/SPARK-19214
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>            Reporter: Alexander Alexandrov
>            Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10; 
>   y <- 1 to 10; 
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  |    |-- _1: integer (nullable = true)
> //  |    |-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

Reply via email to