Yash Datta created SPARK-5684:
---------------------------------
Summary: Key not found exception is thrown in case location of
added partition to a parquet table is different than a path containing the
partition values
Key: SPARK-5684
URL: https://issues.apache.org/jira/browse/SPARK-5684
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.2.0, 1.1.1, 1.1.0
Reporter: Yash Datta
Priority: Critical
Fix For: 1.3.0
Create a partitioned parquet table :
create table test_table (dummy string) partitioned by (timestamp bigint) stored
as parquet;
Add a partition to the table and specify a different location:
alter table test_table add partition (timestamp=9) location
'/data/pth/different'
Run a simple select * query
we get an exception :
15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from
db4_mi2mi_binsrc1_default limit 5]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0
(TID 21, localhost): java
.util.NoSuchElementException: key not found: timestamp
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:141)
at
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4.apply(ParquetTableOperations.scala:128)
at
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:247)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
This happens because in parquet path it is assumed that (key=value) patterns
are present in the partition location, which is not always the case!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]