Tzach Zohar created SPARK-9936:
----------------------------------
Summary: decimal precision lost when loading DataFrame from RDD
Key: SPARK-9936
URL: https://issues.apache.org/jira/browse/SPARK-9936
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.0
Reporter: Tzach Zohar
It seems that when converting an RDD that contains BigDecimals into a DataFrame
(using SQLContext.createDataFrame without specifying schema), precision info is
lost, which means saving as Parquet file will fail (Parquet tries to verify
precision < 18, so fails if it's unset).
This seems to be similar to
[SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196], which fixed the
same issue for DataFrames created via JDBC.
To reproduce:
{code:none}
scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a",
BigDecimal.valueOf(0.234))))
rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = ParallelCollectionRDD[0]
at parallelize at <console>:23
scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd)
df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)]
scala> df.write.parquet("/data/parquet-file")
15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0
(TID 0)
java.lang.RuntimeException: Unsupported datatype DecimalType()
{code}
To verify this is indeed caused by the precision being lost, I've tried
manually changing the schema to include precision (by traversing the
StructFields and replacing the DecimalTypes with altered DecimalTypes),
creating a new DataFrame using this updated schema - and indeed it fixes the
problem.
I'm using Spark 1.4.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]