[PR] [SPARK-48965][SQL] Use the correct schema in `Dataset#toJSON` [spark]

via GitHub Tue, 03 Sep 2024 16:10:52 -0700


bersprockets opened a new pull request, #47982:
URL: https://github.com/apache/spark/pull/47982


   ### What changes were proposed in this pull request?
   
   In `Dataset#toJSON`, use the schema from `exprEnc`. This schema reflects any 
changes (e.g., decimal precision, column ordering) that `exprEnc` might make to 
input rows.
   
   ### Why are the changes needed?
   
   `Dataset#toJSON` currently uses the schema from the logical plan, but that 
schema does not necessarily describe the rows passed to `JacksonGenerator`: the 
function passed to `mapPartitions` uses `exprEnc` to serialize the input, and 
this could potentially change the precision on decimals or rearrange columns.
   
   Here's an example that tricks `UnsafeRow#getDecimal` (called from 
`JacksonGenerator`) to mistakenly assume the decimal is stored as a Long:
   ```
   scala> case class Data(a: BigDecimal)
   class Data
   
   scala> sql("select 123.456bd as a").as[Data].toJSON.collect
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   val res0: Array[String] = Array({"a":68719476.745})
   
   scala>
   ```
   Here's an example that tricks `JacksonGenerator` to ask for a string from an 
array and an array from a string. This case actually crashes the JVM:
   ```
   scala> case class Data(x: Array[Int], y: String)
   class Data
   
   scala> sql("select repeat('Hey there', 17) as y, array_repeat(22, 17) as 
x").as[Data].toJSON.collect
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   Exception in task 0.0 in stage 0.0 (TID 0)
   java.lang.InternalError: a fault occurred in a recent unsafe memory access 
operation in compiled Java code
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5(JacksonGenerator.scala:129)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$5$adapted(JacksonGenerator.scala:128)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArrayData(JacksonGenerator.scala:258)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$23(JacksonGenerator.scala:201)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.json.JacksonGenerator.writeArray(JacksonGenerator.scala:249)
 ~[spark-catalyst_2.13-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
   ...
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
   
   bash-3.2$ 
   ```
   Both these cases work correctly without `toJSON`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Before the PR, converting the dataframe to a dataset of Tuple would preserve 
the column names in the JSON strings:
   ```
   scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   val res0: Array[String] = Array({"a":123.456,"b":12})
   
   scala> 
   ```
   After the PR, the JSON strings use the field name from the Tuple class:
   ```
   scala> sql("select 123.456d as a, 12 as b").as[(Double, Int)].toJSON.collect
   warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
   val res1: Array[String] = Array({"_1":123.456,"_2":12})
   
   scala> 
   ```
   
   ### How was this patch tested?
   
   New tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-48965][SQL] Use the correct schema in `Dataset#toJSON` [spark]

Reply via email to