bersprockets opened a new pull request, #47982:
URL: https://github.com/apache/spark/pull/47982
### What changes were proposed in this pull request?
In `Dataset#toJSON`, use the schema from `exprEnc`. This schema reflects any
changes (e.g., decimal precision, column ordering) that `exprEnc` might make to
input rows.
### Why are the changes needed?
`Dataset#toJSON` currently uses the schema from the logical plan, but that
schema does not necessarily describe the rows passed to `JacksonGenerator`: the
function passed to `mapPartitions` uses `exprEnc` to serialize the input, and
this could potentially change the precision on decimals or rearrange columns.
Here's an example that tricks `UnsafeRow#getDecimal` (called from
`JacksonGenerator`) to mistakenly assume the decimal is stored as a Long:
```
scala> case class Data(a: BigDecimal)
class Data
scala> sql("select 123.456bd as a").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
val res0: Array[String] = Array({"a":68719476.745})
scala>
```
Here's an example that tricks `JacksonGenerator` to ask for a string from an
array and an array from a string. This case actually crashes the JVM:
```
scala> case class Data(x: Array[Int], y: String)
class Data
scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as
x").as[Data].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.InternalError: a fault occurred in a recent unsafe memory access
operation in compiled Java code
at
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5(JacksonGenerator.scala:129)
~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5$adapted(JacksonGenerator.scala:128)
~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArrayData(JacksonGenerator.scala:258)
~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23(JacksonGenerator.scala:201)
~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at
org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArray(JacksonGenerator.scala:249)
~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
...
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
bash-3.2$
```
Both these cases work correctly without `toJSON`.
### Does this PR introduce _any_ user-facing change?
Before the PR, converting the dataframe to a dataset of Tuple would preserve
the column names in the JSON strings:
```
scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
val res0: Array[String] = Array({"a":123.456,"b":12})
scala>
```
After the PR, the JSON strings use the field name from the Tuple class:
```
scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
val res1: Array[String] = Array({"_1":123.456,"_2":12})
scala>
```
### How was this patch tested?
New tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]