Re: Spark SQL, Hive & Parquet data types

Cheng Lian Mon, 23 Feb 2015 19:10:44 -0800

Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, theold one is tightly coupled with the Spark SQL framework, while the newone is based on data sources API. In both versions, we try to interceptoperations over Parquet tables registered in metastore when possible forbetter performance (mainly filter push-down optimization and extrametadata for more accurate schema inference). The distinctions are:


1.

   For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” the read path. Namely whenever you query a Parquet table
   registered in metastore, we’re using our own Parquet implementation.

   For write path, we fallback to default Hive SerDe implementation
   (namely Spark SQL’s |InsertIntoHiveTable| operator).

2.

   For new data source version (set
   |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
   value in master and branch-1.3)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” both read and write path, but if you’re writing to a
   partitioned table, we still fallback to default Hive SerDe
   implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet datasource, but it’s not enabled if you’re not using data sources APIspecific DDL (|CREATE TEMPORARY TABLE <table-name> USING <data-source>|).


Cheng

On 2/23/15 10:05 PM, The Watcher wrote:

Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
own Parquet support to read partitioned Parquet tables declared in Hive
metastore. Only writing to partitioned tables is not covered yet. These
improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.

Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
registrered", what is the write-path in this case ? Full Hive with a
SparkSQL<->Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks

Re: Spark SQL, Hive & Parquet data types

Reply via email to