[jira] [Commented] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

Rahul Aggarwal (JIRA) Thu, 01 Jan 2015 05:10:24 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262547#comment-14262547
 ]


Rahul Aggarwal commented on SPARK-5049:
---------------------------------------

https://github.com/apache/spark/pull/3870

> ParquetTableScan always prepends the values of partition columns in output 
> rows irrespective of the order of the partition columns in the original 
> SELECT query
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5049
>                 URL: https://issues.apache.org/jira/browse/SPARK-5049
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Rahul Aggarwal
>
> This happens when ParquetTableScan is being used by turning on 
> spark.sql.hive.convertMetastoreParquet
> For example:
> spark-sql> set spark.sql.hive.convertMetastoreParquet=true;
> spark-sql> create table table1(a int , b int) partitioned by (p1 string, p2 
> int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
> INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
> 'parquet.hive.DeprecatedParquetOutputFormat';
> spark-sql> insert into table table1 partition(p1='January',p2=1) select key, 
> 10  from src;    
> spark-sql> select a, b, p1, p2 from table1 limit 10;
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> The correct output should be 
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> This also leads to schema mismatch if the query is run using HiveContext and 
> the result is a SchemaRDD.
> For example :
> scala> import org.apache.spark.sql.hive._
> scala> val hc = new HiveContext(sc)
> scala> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
> scala> val res = hc.sql("select a, b, p1, p2 from table1 limit 10")
> scala> res.collect
> res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
> [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
> [January,1,409,10], [January,1,255,10], [January,1,278,10], 
> [January,1,98,10], [January,1,484,10])
> scala> res.schema
> res5: org.apache.spark.sql.StructType = 
> StructType(ArrayBuffer(StructField(a,IntegerType,true), 
> StructField(b,IntegerType,true), StructField(p1,StringType,true), 
> StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

Reply via email to