Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20402#discussion_r164338267
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,11 @@ import org.apache.spark.util.Utils
     
     private[sql] object Dataset {
       def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
    -    new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
    +    val dataset = new Dataset(sparkSession, logicalPlan, 
implicitly[Encoder[T]])
    +    // Eagerly bind the encoder so we verify that the encoder matches the 
underlying
    +    // schema. The user will get an error if this is not the case.
    --- End diff --
    
    This is needed because we made `Dataset.deserializer` lazy. The idea there 
is that we only need to create a `deserializer` (which is expensive) when we 
materialize data. However for Datasets we also need to make sure that we can 
actually deserialize an InternalRow into the Dataset type. This does not apply 
to Dataframes and the `ofRows` method because we are guaranteed that that an 
InternalRow can be deserialized into a Row.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to