[GitHub] spark pull request: [SPARK-14231][SQL] JSON data source infers flo...

HyukjinKwon Tue, 29 Mar 2016 01:25:08 -0700

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/12030


    [SPARK-14231][SQL] JSON data source infers floating-point values as a 
double when they do not fit in a decimal

    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-14231
    
    Currently, JSON data source supports to infer `DecimalType` for big numbers 
and `floatAsBigDecimal` option which reads floating-point values as 
`DecimalType`.
    
    But there are few restrictions in Spark `DecimalType` below:
    
    1. The precision cannot be bigger than 38.
    2. scale cannot be bigger than precision. 
    
    Currently, both restrictions are not being handled.
    
    This PR handles the cases by inferring them as `DoubleType`. Also, the 
option name was changed from `floatAsBigDecimal` to `prefersDecimal` as 
suggested 
[here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
    
    ```scala
    def doubleRecords: RDD[String] =
      sqlContext.sparkContext.parallelize(
        s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
        s"""{"a": 0.${"0" * 38}1, "b": 0.02}""" :: Nil)
    
    val jsonDF = sqlContext.read
      .option("prefersDecimal", "true")
      .json(doubleRecords)
    jsonDF.printSchema()
    ```
    
    - **Before**
    
    ```scala
    org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater 
than precision (1).;
        at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
        at 
org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
        at 
    ...
    ```
    
    - **After**
    
    ```scala
    root
     |-- a: double (nullable = true)
     |-- b: double (nullable = true)
    ```
    
    ## How was this patch tested?
    
    Unit tests were used and `./dev/run_tests` for coding style tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-14231

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12030.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12030
    
----
commit f5a60dc3898d671c903f45a6a5c65334711761d3
Author: hyukjinkwon <[email protected]>
Date:   2016-03-29T07:52:55Z

    Infer floating-point values as a double when they do not fit in a decimal

commit 7e999aeb878a24794203d0065f310c9c2b21a1c1
Author: hyukjinkwon <[email protected]>
Date:   2016-03-29T08:15:00Z

    Change option name

commit f5613cbbf00beabac3b9fd2f49960296d3aee38e
Author: hyukjinkwon <[email protected]>
Date:   2016-03-29T08:23:40Z

    Separate tests for big integer and double

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14231][SQL] JSON data source infers flo...

Reply via email to