Github user hvanhovell commented on a diff in the pull request:
https://github.com/apache/spark/pull/20402#discussion_r164338267
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -62,7 +62,11 @@ import org.apache.spark.util.Utils
private[sql] object Dataset {
def apply[T: Encoder](sparkSession: SparkSession, logicalPlan:
LogicalPlan): Dataset[T] = {
- new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
+ val dataset = new Dataset(sparkSession, logicalPlan,
implicitly[Encoder[T]])
+ // Eagerly bind the encoder so we verify that the encoder matches the
underlying
+ // schema. The user will get an error if this is not the case.
--- End diff --
This is needed because we made `Dataset.deserializer` lazy. The idea there
is that we only need to create a `deserializer` (which is expensive) when we
materialize data. However for Datasets we also need to make sure that we can
actually deserialize an InternalRow into the Dataset type. This does not apply
to Dataframes and the `ofRows` method because we are guaranteed that that an
InternalRow can be deserialized into a Row.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]