Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the old one is tightly coupled with the Spark SQL framework, while the new one is based on data sources API. In both versions, we try to intercept operations over Parquet tables registered in metastore when possible for better performance (mainly filter push-down optimization and extra metadata for more accurate schema inference). The distinctions are:

1.

   For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” the read path. Namely whenever you query a Parquet table
   registered in metastore, we’re using our own Parquet implementation.

   For write path, we fallback to default Hive SerDe implementation
   (namely Spark SQL’s |InsertIntoHiveTable| operator).

2.

   For new data source version (set
   |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
   value in master and branch-1.3)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” both read and write path, but if you’re writing to a
   partitioned table, we still fallback to default Hive SerDe
   implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data source, but it’s not enabled if you’re not using data sources API specific DDL (|CREATE TEMPORARY TABLE <table-name> USING <data-source>|).

Cheng

On 2/23/15 10:05 PM, The Watcher wrote:

Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
own Parquet support to read partitioned Parquet tables declared in Hive
metastore. Only writing to partitioned tables is not covered yet. These
improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.

Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
registrered", what is the write-path in this case ? Full Hive with a
SparkSQL<->Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks

Reply via email to