[GitHub] [spark] EnricoMi commented on a change in pull request #26969: [SPARK-30319][SQL] Add a stricter version of `as[T]`

GitBox Fri, 17 Jan 2020 01:48:34 -0800

EnricoMi commented on a change in pull request #26969: [SPARK-30319][SQL] Add a 
stricter version of `as[T]`
URL: https://github.com/apache/spark/pull/26969#discussion_r367849005


 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -495,6 +495,25 @@ class Dataset[T] private[sql](
     select(newCols : _*)
   }
 
+  /**
+   * Returns a new Dataset where each record has been mapped on to the 
specified type.
+   * This only supports `U` being a class. Fields for the class will be mapped 
to columns of the
+   * same name (case sensitivity is determined by `spark.sql.caseSensitive`).
+   *
+   * If the schema of the Dataset does not match the desired `U` type, you can 
use `select`
+   * along with `alias` or `as` to rearrange or rename as required.
+   *
+   * This method eagerly projects away any columns that are not present in the 
specified class.
+   * It further guarantees the order of columns as well as data types to match 
`U`.
+   *
+   * @group basic
+   * @since 3.0.0
+   */
+  def toDS[U : Encoder]: Dataset[U] = {
+    val columns = implicitly[Encoder[U]].schema.fields.map(f => 
col(f.name).cast(f.dataType))
 
 Review comment:
   Do you agree that the lazy-ness of `as[T]` is an inconsistency of the API 
where all other transformations' schema instantly reflect the result's schema? 
With `as[T]` you need a `map(identity)` to really become a `Dataset[T]`.
   
   Which properties of `toDS[T]` would you require?
   # case classes, tuples and inner fields supported
   # arbitrary implementations of `Encoder` supported
   # data (not just schema) of result `Dataset[T]` are identical to 
`as[T].map(identity)`, which is given when they follow schema `encoder.schema`
   # uses projection where possible (efficient), falls back to `map(identity)` 
when needed (expensive)
   
   The idea is to turn the dataset into the same shape as it would look like 
when the encoder would produce it (`map(identity)`). In case of 
`ExpressionEncoder`, the encoder can tell you which columns it serialises / 
deserialises. If the schema of those columns does not change (e.g. unchanged 
inner fields), than a simple projection suffices: select those fields in the 
right order. The data then comply to `encoder.schema`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] EnricoMi commented on a change in pull request #26969: [SPARK-30319][SQL] Add a stricter version of `as[T]`

Reply via email to