[GitHub] [spark] cdegroc commented on a change in pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

GitBox Tue, 11 Jan 2022 09:10:43 -0800


cdegroc commented on a change in pull request #35139:
URL: https://github.com/apache/spark/pull/35139#discussion_r782360648




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
##########
@@ -110,23 +110,28 @@ object ExpressionEncoder {
     }
     val newSerializer = CreateStruct(serializers)
 
+    def nullSafe(input: Expression, result: Expression): Expression = {
+      If(IsNull(input), Literal.create(null, result.dataType), result)
+    }
+
     val newDeserializerInput = GetColumnByOrdinal(0, newSerializer.dataType)
     val deserializers = encoders.zipWithIndex.map { case (enc, index) =>
       val getColExprs = enc.objDeserializer.collect { case c: 
GetColumnByOrdinal => c }.distinct
       assert(getColExprs.size == 1, "object deserializer should have only one 
" +
         s"`GetColumnByOrdinal`, but there are ${getColExprs.size}")
 
       val input = GetStructField(newDeserializerInput, index)
-      enc.objDeserializer.transformUp {
+      val newDeserializer = enc.objDeserializer.transformUp {

Review comment:
       The difference comes from the `RowEncoder` deserializer:
   - When using a `Dataset[T]`, the `ExpressionEncoder` is used and calls 
[`ScalaReflection.deserializerForType`](https://github.com/apache/spark/blob/edc52857ca3ac031481a910f7871ff8d5e030e18/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L56)
 to get a deserializer for class `T`, which automatically [wraps the expression 
in a null-safe 
expression](https://github.com/apache/spark/blob/58d3f1516ed812b692709991e551829aa0090578/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L390-L394).
   - When using a `DataFrame`, the `RowEncoder` is used and returns a 
[`CreateExternalRow` (not wrapped in a null-safe 
expression)](https://github.com/apache/spark/blob/58d3f1516ed812b692709991e551829aa0090578/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala#L254-L259).
   
   I'm not sure there's an easy way to solve this, as the RowEncoder should 
guarantee (afaik) that top-level Rows aren't `null`.
   
   Actually I think everything is already summarized in your initial PR that 
patched the tuple encoder to wrap deserializers in a null-safe way: 
https://github.com/apache/spark/pull/13425




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cdegroc commented on a change in pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

Reply via email to