[ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262547#comment-14262547
 ] 

Rahul Aggarwal commented on SPARK-5049:
---------------------------------------

https://github.com/apache/spark/pull/3870

> ParquetTableScan always prepends the values of partition columns in output 
> rows irrespective of the order of the partition columns in the original 
> SELECT query
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5049
>                 URL: https://issues.apache.org/jira/browse/SPARK-5049
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: Rahul Aggarwal
>
> This happens when ParquetTableScan is being used by turning on 
> spark.sql.hive.convertMetastoreParquet
> For example:
> spark-sql> set spark.sql.hive.convertMetastoreParquet=true;
> spark-sql> create table table1(a int , b int) partitioned by (p1 string, p2 
> int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
> INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
> 'parquet.hive.DeprecatedParquetOutputFormat';
> spark-sql> insert into table table1 partition(p1='January',p2=1) select key, 
> 10  from src;    
> spark-sql> select a, b, p1, p2 from table1 limit 10;
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> January       1       484     10
> The correct output should be 
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> 484   10      January 1
> This also leads to schema mismatch if the query is run using HiveContext and 
> the result is a SchemaRDD.
> For example :
> scala> import org.apache.spark.sql.hive._
> scala> val hc = new HiveContext(sc)
> scala> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
> scala> val res = hc.sql("select a, b, p1, p2 from table1 limit 10")
> scala> res.collect
> res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
> [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
> [January,1,409,10], [January,1,255,10], [January,1,278,10], 
> [January,1,98,10], [January,1,484,10])
> scala> res.schema
> res5: org.apache.spark.sql.StructType = 
> StructType(ArrayBuffer(StructField(a,IntegerType,true), 
> StructField(b,IntegerType,true), StructField(p1,StringType,true), 
> StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to