[ 
https://issues.apache.org/jira/browse/SPARK-47307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824115#comment-17824115
 ] 

Willi Raschkowski commented on SPARK-47307:
-------------------------------------------

The behavior change is as follows:
 * Spark 3.2, 
[here|https://github.com/apache/spark/blob/e428fe902bb1f12cea973de7fe4b885ae69fd6ca/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2206],
 was using Apache's encoder like this: 
{{{}CommonsBase64.encodeBase64(bytes.asInstanceOf[Array[Byte]]){}}}.
 * That {{encodeBase64}} call does _not_ chunk [its 
output|https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html#encodeBase64(byte%5B%5D,boolean,boolean,int)].
 * Falsely assuming that Apache's encoder would follow the RC2045 / MIME spec, 
Spark 3.3 started using [Java's MIME 
encoder|https://github.com/apache/spark/blob/f74867bddfbcdd4d08076db36851e88b15e66556/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2431].
 The MIME encoder [follows the RFC2045 spec and _does 
chunk_|https://datatracker.ietf.org/doc/html/rfc2045#section-6.8:~:text=76%0A%20%20%20%20%20%20%20%20%20%20characters%20long.].
* That chunking is what introduced those {{\r\n}} separators.
 

> Spark 3.3 produces invalid base64
> ---------------------------------
>
>                 Key: SPARK-47307
>                 URL: https://issues.apache.org/jira/browse/SPARK-47307
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Willi Raschkowski
>            Priority: Blocker
>              Labels: correctness
>
> SPARK-37820 was introduced in Spark 3.3 and breaks behavior of {{base64}} 
> (which is fine but shouldn't happen between minor version).
> {code:title=Spark 3.2}
> >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
> 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYQ=='
> {code}
> Note the different output in Spark 3.3 (the addition of {{\r\n}} newlines).
> {code:title=Spark 3.3}
> >>> spark.sql(f"""SELECT base64('{'a' * 58}') AS base64""").collect()[0][0]
> 'YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFh\r\nYQ=='
> {code}
> The former decodes fine with the {{base64}} on my machine but the latter does 
> not:
> {code}
> $ pbpaste | base64 --decode
> aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa%
> $ pbpaste | base64 --decode
> base64: stdin: (null): error decoding base64 input stream
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to