For the second question, we do plan to support Hive 0.14, possibly in
Spark 1.4.0.
For the first question:
1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
type, so you can’t.
2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
own Parquet support to handle both read path and write path when
dealing with Parquet tables declared in Hive metastore, as long as
you’re not writing to a partitioned table. So yes, you can.
The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports
timestamp type natively. However, the Parquet versions bundled with Hive
0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them
supports timestamp type. Hive 0.14.0 “supports” read/write timestamp
from/to Parquet by converting timestamps from/to Parquet binaries.
Similarly, Impala converts timestamp into Parquet int96. This can be
annoying for Spark SQL, because we must interpret Parquet files in
different ways according to the original writer of the file. As Parquet
matures, recent Parquet versions support more and more standard data
types. Mappings from complex nested types to Parquet types are also
being standardized 1
<https://github.com/apache/incubator-parquet-mr/pull/83>.
On 2/20/15 6:50 AM, The Watcher wrote:
Still trying to get my head around Spark SQL & Hive.
1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
tables, declared in a Hive meta-store.
Does it matter at all if Hive supports the data types I need with Parquet,
or is all that matters what Catalyst & spark's parquet relation support ?
Case in point : timestamps & Parquet
* Parquet now supports them as per
https://github.com/Parquet/parquet-mr/issues/218
* Hive only supports them in 0.14
So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
1.3 ?
I have found this thread
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
which seems to indicate that the data types supported by Hive would matter
to Spark SQL.
If so, why is that ? Doesn't the read path go through Spark SQL to read the
parquet file ?
2) Is there planned support for Hive 0.14 ?
Thanks