AngersZhuuuu commented on a change in pull request #29421:
URL: https://github.com/apache/spark/pull/29421#discussion_r511688786



##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
##########
@@ -111,15 +111,14 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
               .zip(outputFieldWriters)
               .map { case (data, writer) => writer(data) })
       } else {
-        // In schema less mode, hive default serde will choose first two 
output column as output
-        // if output column size less then 2, it will throw 
ArrayIndexOutOfBoundsException.
-        // Here we change spark's behavior same as hive's default serde.
-        // But in hive, TRANSFORM with schema less behavior like origin spark, 
we will fix this
-        // to keep spark and hive behavior same in SPARK-32388
+        // In schema less mode, hive will choose first two output column as 
output.
+        // If output column size less then 2, it will return NULL for columns 
with missing values.
+        // Here we split row string and choose first 2 values, if values's 
size less then 2,
+        // we pad NULL value until 2 to make behavior same with hive.
         val kvWriter = 
CatalystTypeConverters.createToCatalystConverter(StringType)
         prevLine: String =>
           new GenericInternalRow(
-            prevLine.split(outputRowFormat).slice(0, 2)
+            prevLine.split(outputRowFormat).slice(0, 2).padTo(2, null)

Review comment:
       > @AngersZhuuuu, can you add a configuration to control the legacy 
behaviour, and add a note to the migration guide since it's arguable to say 
this is a bug? With that, I think I can leave my sign-off and merge.
   
   This suggestion looks good to me, but since we have fix a similar issue.  
https://github.com/apache/spark/commit/c75a82794fc6a0f35697f8e1258562d43e860f68#diff-1bdf8d74f6b4f84d5acb1d1490503cde337fe64fe5d21339d7fedb06bd7cabc0
   
   How about merge this and  I will follow a pr to add a configuration to 
control the legacy behavior.  Change these two point together seems to be 
better?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to