[
https://issues.apache.org/jira/browse/SPARK-48965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun closed SPARK-48965.
---------------------------------
> toJSON produces wrong values if DecimalType information is lost in as[Product]
> ------------------------------------------------------------------------------
>
> Key: SPARK-48965
> URL: https://issues.apache.org/jira/browse/SPARK-48965
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.1.1, 3.5.1
> Reporter: Dmitry Lapshin
> Assignee: Bruce Robbins
> Priority: Major
> Labels: correctness, pull-request-available
> Fix For: 3.4.4, 3.5.3, 4.0.0
>
>
> Consider this example:
> {code:scala}
> package com.jetbrains.jetstat.etl
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.DecimalType
> object A {
> case class Example(x: BigDecimal)
> def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder()
> .master("local[1]")
> .getOrCreate()
> import spark.implicits._
> val originalRaw = BigDecimal("123.456")
> val original = Example(originalRaw)
> val ds1 = spark.createDataset(Seq(original))
> val ds2 = ds1
> .withColumn("x", $"x" cast DecimalType(12, 6))
> val ds3 = ds2
> .as[Example]
> println(s"DS1: schema=${ds1.schema},
> encoder.schema=${ds1.encoder.schema}")
> println(s"DS2: schema=${ds1.schema},
> encoder.schema=${ds2.encoder.schema}")
> println(s"DS3: schema=${ds1.schema},
> encoder.schema=${ds3.encoder.schema}")
> val json1 = ds1.toJSON.collect().head
> val json2 = ds2.toJSON.collect().head
> val json3 = ds3.toJSON.collect().head
> val collect1 = ds1.collect().head
> val collect2_ = ds2.collect().head
> val collect2 = collect2_.getDecimal(collect2_.fieldIndex("x"))
> val collect3 = ds3.collect().head
> println(s"Original: $original (scale = ${original.x.scale}, precision =
> ${original.x.precision})")
> println(s"Collect1: $collect1 (scale = ${collect1.x.scale}, precision =
> ${collect1.x.precision})")
> println(s"Collect2: $collect2 (scale = ${collect2.scale}, precision =
> ${collect2.precision})")
> println(s"Collect3: $collect3 (scale = ${collect3.x.scale}, precision =
> ${collect3.x.precision})")
> println(s"json1: $json1")
> println(s"json2: $json2")
> println(s"json3: $json3")
> }
> }
> {code}
> Running it you'd see that json3 contains very much wrong data. After a bit of
> debugging, and sorry since I'm bad with Spark internals, I've found that:
> * In-memory representation of the data in this example used {{UnsafeRow}},
> whose {{.getDecimal}} uses compression to store small Decimal values as
> longs, but doesn't remember decimal sizing parameters,
> * However, there are at least two sources for precision & scale to pass to
> that method: {{Dataset.schema}} (which is based on query execution, always
> contains 38,18 for me) and {{Dataset.encoder.schema}} (that gets updated in
> `ds2` to 12,6 but then is reset in `ds3`). Also, there is a
> {{Dataset.deserializer}} that seems to be combining those two non-trivially.
> * This doesn't seem to affect {{Dataset.collect()}} methods since they use
> {{deserializer}}, but {{Dataset.toJSON}} only uses the first schema.
> Seems to me that either {{.toJSON}} should be more aware of what's going on
> or {{.as[]}} should be doing something else.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]