JeelRajodiya commented on PR #21331:
URL: https://github.com/apache/datafusion/pull/21331#issuecomment-4275897689

   Hey @andygrove, I realized that I shouldn't be using `enable_ansi_mode` flag 
inside encode function. In the spark definition they are not binding the ansi 
mode to encode function. 
   
   Moreover we should target Spark 3.5 which is more permissive and doesn't 
return errors when null inputs are passed. it simply replaces it with `?`. But 
I've added a TODO in the doc comment pointing at the two real Spark 4.1 configs 
so a follow-up PR can wire them properly.
   
   **Below are the references to the spark definitions**
   Spark 3.5's 
[`Encode.scala`](https://github.com/apache/spark/blob/2a56312aeb1665b72c608e14926f5d69fd3a17bc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2698-L2741):
   ``` scala
   protected override def nullSafeEval(input1: Any, input2: Any): Any = {
     input1.asInstanceOf[UTF8String].toString.getBytes(toCharset)
   }
   ```
   Just calls Java's `String.getBytes`, which replaces unmappable chars with 
the charset's default byte (?). No `legacyErrorAction`, no config, no exception.
   
     Spark 4.1's 
[`Encode.scala`](https://github.com/apache/spark/blob/acfae3372874631728243ba13728f6abbf7ee07b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L3170-L3228)
 added two new configs for the strict behavior:
   ``` scala
   case class Encode(str, charset, legacyCharsets: Boolean, legacyErrorAction: 
Boolean)
     def this(value, charset) =
       this(value, charset, SQLConf.get.legacyJavaCharsets, 
SQLConf.get.legacyCodingErrorAction)
   ```
   >    Setting legacyErrorAction=true restores the Spark 3.5 `?` behavior.
   
   These `spark.sql.legacy.javaCharsets` and 
`spark.sql.legacy.codingErrorAction` are supported in 4.1 version. which can be 
left for future PR. Currently the PR targets Spark 3.5. I've added mentioned in 
the doc comment as well. 
   
   Let me know if we need to iterate on this further. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to