[GitHub] [spark] peter-toth commented on a change in pull request #31955: [SPARK-34829][SQL] Fix typed ScalaUDF result conversion

GitBox Thu, 25 Mar 2021 06:43:16 -0700


peter-toth commented on a change in pull request #31955:
URL: https://github.com/apache/spark/pull/31955#discussion_r601496353




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
##########
@@ -124,10 +124,10 @@ case class ScalaUDF(
     val toRow = enc.createSerializer().asInstanceOf[Any => Any]
     if (enc.isSerializedAsStructForTopLevel) {
       value: Any =>
-        if (value == null) null else toRow(value).asInstanceOf[InternalRow]
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy()
     } else {
       value: Any =>
-        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].get(0, dataType)
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy().get(0, dataType)

Review comment:
       > I think the current fix can cause performance regression in case of 
ScalaUDF + non higher order functions.
   Well, yes, in case of *typed* `ScalaUDF`s my change can cause regression. 
But your suggested change can cause regression in case of other `UDF`s + 
`higher order functions` which is admittedly rarer than typed `ScalaUDF` + `non 
higher order functions`.
   So I'm not against fixing this in all affected higher order functions, but I 
wouldn't expect `ScalaUDF.eval()` to return with something that needs to be 
copied first to make sure another invocation doesn't change it.

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala
##########
@@ -124,10 +124,10 @@ case class ScalaUDF(
     val toRow = enc.createSerializer().asInstanceOf[Any => Any]
     if (enc.isSerializedAsStructForTopLevel) {
       value: Any =>
-        if (value == null) null else toRow(value).asInstanceOf[InternalRow]
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy()
     } else {
       value: Any =>
-        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].get(0, dataType)
+        if (value == null) null else 
toRow(value).asInstanceOf[InternalRow].copy().get(0, dataType)

Review comment:
       > I think the current fix can cause performance regression in case of 
ScalaUDF + non higher order functions.
   
   Well, yes, in case of *typed* `ScalaUDF`s my change can cause regression. 
But your suggested change can cause regression in case of other `UDF`s + 
`higher order functions` which is admittedly rarer than typed `ScalaUDF` + `non 
higher order functions`.
   So I'm not against fixing this in all affected higher order functions, but I 
wouldn't expect `ScalaUDF.eval()` to return with something that needs to be 
copied first to make sure another invocation doesn't change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #31955: [SPARK-34829][SQL] Fix typed ScalaUDF result conversion

Reply via email to