GitHub user saucam opened a pull request:

    https://github.com/apache/spark/pull/4469

    SPARK-5684: Pass in partition name along with location information, as the 
location can be different (that is may not contain the partition keys)

    While parsing the partition keys from the locations, in parquetRelations, 
it is assumed that location path string will always contain the partition keys, 
which is not true. Different location can be specified while adding partitions 
to the table, which results in key not found exception while reading from such 
partitions:
    
    Create a partitioned parquet table :
    create table test_table (dummy string) partitioned by (timestamp bigint) 
stored as parquet;
    Add a partition to the table and specify a different location:
    alter table test_table add partition (timestamp=9) location 
'/data/pth/different'
    Run a simple select * query
    we get an exception :
    15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * 
from db4_mi2mi_binsrc1_default limit 5]
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 
(TID 21, localhost): java
    .util.NoSuchElementException: key not found: timestamp
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
    at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/saucam/spark partition_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4469.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4469
    
----
commit 5aeeb6db8a3651b7b13d641ec0ed0dea21025438
Author: Yash Datta <[email protected]>
Date:   2015-02-09T08:53:40Z

    SPARK-5684: Pass in partition name along with location information, as the 
location can be different (that is may not contain the partition keys)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to