[
https://issues.apache.org/jira/browse/SPARK-18059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-18059:
---------------------------------
Labels: bulk-closed (was: )
> Spark not respecting partitions in partitioned Hive views
> ---------------------------------------------------------
>
> Key: SPARK-18059
> URL: https://issues.apache.org/jira/browse/SPARK-18059
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.5.1
> Environment: ResourceManager version: 2.6.0-cdh5.4.2
> Hadoop version: 2.6.0-cdh5.4.2
> Reporter: Sunil Srivatsa
> Priority: Major
> Labels: bulk-closed
>
> For Hive partitioned views
> (https://cwiki.apache.org/confluence/display/Hive/PartitionedViews), when you
> specify a partition Spark reads from all partitions rather than just the
> specified ones. For example:
> create table srcpart_1 (key int, value string) partitioned by (ds string);
> create table srcpart_2 (key int, value string) partitioned by (ds string);
> CREATE VIEW vp2
> PARTITIONED ON (ds)
> AS
> SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds
> FROM srcpart_1
> join srcpart_2
> on srcpart_1.key = srcpart_2.key
> and srcpart_1.ds=srcpart_2.ds;
> ALTER VIEW vp2 ADD PARTITION (ds='2016-01-01')
> When a query executes, e.g.:
> SELECT key, value FROM vp2 WHERE ds = '2016-01-01'
> Spark reads Parquet files for all partitions. This is easy to observe in a
> job by printing out df.inputFiles. The Hive explains for executing that
> query against the view are the same as a normal join, which suggests Hive is
> behaving correctly:
> SELECT srcpart_1.key, srcpart_2.value, srcpart_1.ds FROM srcpart_1
> JOIN srcpart_2 ON srcpart_1.key = srcpart_2.key WHERE srcpart_1.ds =
> '2016-01-01' and srcpart_2.ds = '2016-01-01'
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]