For the second question, we do plan to support Hive 0.14, possibly in Spark 1.4.0.

For the first question:

1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
   type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
   own Parquet support to handle both read path and write path when
   dealing with Parquet tables declared in Hive metastore, as long as
   you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports timestamp type natively. However, the Parquet versions bundled with Hive 0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them supports timestamp type. Hive 0.14.0 “supports” read/write timestamp from/to Parquet by converting timestamps from/to Parquet binaries. Similarly, Impala converts timestamp into Parquet int96. This can be annoying for Spark SQL, because we must interpret Parquet files in different ways according to the original writer of the file. As Parquet matures, recent Parquet versions support more and more standard data types. Mappings from complex nested types to Parquet types are also being standardized 1 <https://github.com/apache/incubator-parquet-mr/pull/83>.

On 2/20/15 6:50 AM, The Watcher wrote:

Still trying to get my head around Spark SQL & Hive.

1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.

Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst & spark's parquet relation support ?

Case in point : timestamps & Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?

I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?

2) Is there planned support for Hive 0.14 ?

Thanks

Reply via email to