Don Drake created SPARK-19477:
---------------------------------

             Summary: [SQL] Datasets created from a Dataframe with extra 
columns retain the extra columns
                 Key: SPARK-19477
                 URL: https://issues.apache.org/jira/browse/SPARK-19477
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Don Drake


In 1.6, when you created a Dataset from a Dataframe that had extra columns, the 
columns not in the case class were dropped from the Dataset.

For example in 1.6, the column c4 is gone:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import sqlContext.implicits._
import sqlContext.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
string]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]

scala> ds.show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|

{code}

This seems to have changed in Spark 2.0 and also 2.1:

Spark 2.1.0:

{code}
scala> case class F(f1: String, f2: String, f3:String)
defined class F

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
"j","z")).toDF("f1", "f2", "f3", "c4")
df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more fields]

scala> val ds = df.as[F]
ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more fields]

scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders

scala> val fEncoder = Encoders.product[F]
fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, 
f3[0]: string]

scala> fEncoder.schema == ds.schema
res2: Boolean = false

scala> ds.schema
res3: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true), StructField(c4,StringType,true))

scala> fEncoder.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
StructField(f3,StringType,true))


{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to