Re: Spark SQL, Hive & Parquet data types

Cheng Lian Fri, 20 Feb 2015 05:45:58 -0800

For the second question, we do plan to support Hive 0.14, possibly inSpark 1.4.0.


For the first question:


1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
   type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
   own Parquet support to handle both read path and write path when
   dealing with Parquet tables declared in Hive metastore, as long as
   you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supportstimestamp type natively. However, the Parquet versions bundled with Hive0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of themsupports timestamp type. Hive 0.14.0 “supports” read/write timestampfrom/to Parquet by converting timestamps from/to Parquet binaries.Similarly, Impala converts timestamp into Parquet int96. This can beannoying for Spark SQL, because we must interpret Parquet files indifferent ways according to the original writer of the file. As Parquetmatures, recent Parquet versions support more and more standard datatypes. Mappings from complex nested types to Parquet types are alsobeing standardized 1<https://github.com/apache/incubator-parquet-mr/pull/83>.


On 2/20/15 6:50 AM, The Watcher wrote:

Still trying to get my head around Spark SQL & Hive.

1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.

Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst & spark's parquet relation support ?

Case in point : timestamps & Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?

I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?

2) Is there planned support for Hive 0.14 ?

Thanks

Re: Spark SQL, Hive & Parquet data types

Reply via email to