Hi, Spark Users:
I have a problem related to Spark cannot recognize the string type in the 
Parquet schema generated by Hive.
Version of all components:
Spark 1.3.1Hive 0.12.0Parquet 1.3.2
I generated a detail low level table in the Parquet format using MapReduce java 
code. This table can be read in the Hive and Spark without any issue.
Now I create a Hive aggregation table like following:
create external table T (    column1 bigint,    column2 string,    
..............)partitioned by (dt string)ROW FORMAT SERDE 
'parquet.hive.serde.ParquetHiveSerDe'STORED ASINPUTFORMAT 
"parquet.hive.DeprecatedParquetInputFormat"OUTPUTFORMAT 
"parquet.hive.DeprecatedParquetOutputFormat"location '/hdfs_location'
Then the table is populated in the Hive by:
set hive.exec.compress.output=true;set parquet.compression=snappy;
insert into table T partition(dt='2015-09-23')select     .............from 
Detail_Tablegroup by 
After this, we can query the T table in the Hive without issue.
But if I try to use it in the Spark 1.3.1 like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)val 
v_event_cnt=sqlContext.parquetFile("/hdfs_location/dt=2015-09-23")
scala> v_event_cnt.printSchemaroot |-- column1: long (nullable = true) |-- 
column2: binary (nullable = true) |-- ............ |-- dt: string (nullable = 
true)
The Spark will recognize column2 as binary type, instead of string type in this 
case, but in the Hive, it works fine.So this bring an issue that in the Spark, 
the data will be dumped as "[B@e353d68". To use it in the Spark, I have to cast 
it as string, to get the correct value out of it.
I wonder this mismatch type of Parquet file could be caused by which part? Is 
the Hive not generate the correct Parquet file with schema, or Spark in fact 
cannot recognize it due to problem in it. 
Is there a way I can do either Hive or Spark to make this parquet schema 
correctly on both ends?
Thanks
Yong                                      

Reply via email to