Mostafa Mokhtar created SPARK-20297:
---------------------------------------
Summary: Parquet Decimal(12,2) written by Spark is unreadable by
Hive and Impala
Key: SPARK-20297
URL: https://issues.apache.org/jira/browse/SPARK-20297
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.1.0
Reporter: Mostafa Mokhtar
Priority: Critical
While trying to load some data using Spark 2.1 I realized that decimal(12,2)
columns stored in Parquet written by Spark are not readable by Hive or Impala.
Repro
{code}
CREATE TABLE customer_acctbal(
c_acctbal decimal(12,2))
STORED AS Parquet;
insert into customer_acctbal values (7539.95);
{code}
Error from Hive
{code}
Failed with exception java.io.IOException:parquet.io.ParquetDecodingException:
Can not read value at 1 in block 0 in file
hdfs://server1:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-03d6e3bb-fe5e-4f20-87a4-88dec955dfcd.snappy.parquet
Time taken: 0.122 seconds
{code}
Error from Impala
{code}
File
'hdfs://server:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal/part-00000-32db4c61-fe67-4be2-9c16-b55c75c517a4.snappy.parquet'
has an incompatible Parquet schema for column
'tpch_nested_3000_parquet.customer_acctbal.c_acctbal'. Column type:
DECIMAL(12,2), Parquet schema:
optional int64 c_acctbal [i:0 d:1 r:0] (1 of 2 similar)
{code}
Table info
{code}
hive> describe formatted customer_acctbal;
OK
# col_name data_type comment
c_acctbal decimal(12,2)
# Detailed Table Information
Database: tpch_nested_3000_parquet
Owner: mmokhtar
CreateTime: Mon Apr 10 17:47:24 PDT 2017
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location:
hdfs://server1.com:8020/user/hive/warehouse/tpch_nested_3000_parquet.db/customer_acctbal
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
numFiles 1
numRows 0
rawDataSize 0
totalSize 120
transient_lastDdlTime 1491871644
# Storage Information
SerDe Library:
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.032 seconds, Fetched: 31 row(s)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]