GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/12030
[SPARK-14231][SQL] JSON data source infers floating-point values as a
double when they do not fit in a decimal
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14231
Currently, JSON data source supports to infer `DecimalType` for big numbers
and `floatAsBigDecimal` option which reads floating-point values as
`DecimalType`.
But there are few restrictions in Spark `DecimalType` below:
1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.
Currently, both restrictions are not being handled.
This PR handles the cases by inferring them as `DoubleType`. Also, the
option name was changed from `floatAsBigDecimal` to `prefersDecimal` as
suggested
[here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
```scala
def doubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
s"""{"a": 0.${"0" * 38}1, "b": 0.02}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(doubleRecords)
jsonDF.printSchema()
```
- **Before**
```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater
than precision (1).;
at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
at
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at
...
```
- **After**
```scala
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
```
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-14231
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12030.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12030
----
commit f5a60dc3898d671c903f45a6a5c65334711761d3
Author: hyukjinkwon <[email protected]>
Date: 2016-03-29T07:52:55Z
Infer floating-point values as a double when they do not fit in a decimal
commit 7e999aeb878a24794203d0065f310c9c2b21a1c1
Author: hyukjinkwon <[email protected]>
Date: 2016-03-29T08:15:00Z
Change option name
commit f5613cbbf00beabac3b9fd2f49960296d3aee38e
Author: hyukjinkwon <[email protected]>
Date: 2016-03-29T08:23:40Z
Separate tests for big integer and double
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]