[GitHub] [spark] cloud-fan commented on a change in pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

GitBox Tue, 11 Jan 2022 06:28:19 -0800


cloud-fan commented on a change in pull request #35139:
URL: https://github.com/apache/spark/pull/35139#discussion_r782198958




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
##########
@@ -110,23 +110,28 @@ object ExpressionEncoder {
     }
     val newSerializer = CreateStruct(serializers)
 
+    def nullSafe(input: Expression, result: Expression): Expression = {
+      If(IsNull(input), Literal.create(null, result.dataType), result)
+    }
+
     val newDeserializerInput = GetColumnByOrdinal(0, newSerializer.dataType)
     val deserializers = encoders.zipWithIndex.map { case (enc, index) =>
       val getColExprs = enc.objDeserializer.collect { case c: 
GetColumnByOrdinal => c }.distinct
       assert(getColExprs.size == 1, "object deserializer should have only one 
" +
         s"`GetColumnByOrdinal`, but there are ${getColExprs.size}")
 
       val input = GetStructField(newDeserializerInput, index)
-      enc.objDeserializer.transformUp {
+      val newDeserializer = enc.objDeserializer.transformUp {

Review comment:
       This bug only occurs for `RowEncoder`, as 
`Dataset[T].joinWith(Dataset[U])` works fine: 
https://github.com/apache/spark/pull/13425/files#diff-b98c99535d2b28cb47774860d500030e732c244c55b1ac05aead5d1cf1e7a602R772
   
   Can you look into it and figure out the difference? This may help us the 
understand the bug better and guide us to the proper fix.
   

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
##########
@@ -110,23 +110,28 @@ object ExpressionEncoder {
     }
     val newSerializer = CreateStruct(serializers)
 
+    def nullSafe(input: Expression, result: Expression): Expression = {
+      If(IsNull(input), Literal.create(null, result.dataType), result)
+    }
+
     val newDeserializerInput = GetColumnByOrdinal(0, newSerializer.dataType)
     val deserializers = encoders.zipWithIndex.map { case (enc, index) =>
       val getColExprs = enc.objDeserializer.collect { case c: 
GetColumnByOrdinal => c }.distinct
       assert(getColExprs.size == 1, "object deserializer should have only one 
" +
         s"`GetColumnByOrdinal`, but there are ${getColExprs.size}")
 
       val input = GetStructField(newDeserializerInput, index)
-      enc.objDeserializer.transformUp {
+      val newDeserializer = enc.objDeserializer.transformUp {

Review comment:
       This bug only occurs for `RowEncoder`, as 
`Dataset[T].joinWith(Dataset[U])` works fine: 
https://github.com/apache/spark/pull/13425/files#diff-b98c99535d2b28cb47774860d500030e732c244c55b1ac05aead5d1cf1e7a602R772
   
   Can you look into it and figure out the difference? This may help us to 
understand the bug better and guide us to the proper fix.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on a change in pull request #35139: [SPARK-37829][SQL] DataFrame.joinWith should return null rows for missing values

Reply via email to