Tzach Zohar created SPARK-9936:
----------------------------------

             Summary: decimal precision lost when loading DataFrame from RDD
                 Key: SPARK-9936
                 URL: https://issues.apache.org/jira/browse/SPARK-9936
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0
            Reporter: Tzach Zohar


It seems that when converting an RDD that contains BigDecimals into a DataFrame 
(using SQLContext.createDataFrame without specifying schema), precision info is 
lost, which means saving as Parquet file will fail (Parquet tries to verify 
precision < 18, so fails if it's unset).

This seems to be similar to 
[SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196], which fixed the 
same issue for DataFrames created via JDBC.

To reproduce:
{code:none}
scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a", 
BigDecimal.valueOf(0.234))))
rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = ParallelCollectionRDD[0] 
at parallelize at <console>:23

scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd)
df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)]

scala> df.write.parquet("/data/parquet-file")
15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
(TID 0)
java.lang.RuntimeException: Unsupported datatype DecimalType()
{code}

To verify this is indeed caused by the precision being lost, I've tried 
manually changing the schema to include precision (by traversing the 
StructFields and replacing the DecimalTypes with altered DecimalTypes), 
creating a new DataFrame using this updated schema - and indeed it fixes the 
problem.

I'm using Spark 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to