[ https://issues.apache.org/jira/browse/SPARK-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046556#comment-15046556 ]
Hyukjin Kwon edited comment on SPARK-6982 at 12/8/15 8:15 AM: -------------------------------------------------------------- Would this be done now by partition key [partition discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery) ? was (Author: hyukjin.kwon): Would this be done now by partition key [partition discovery](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery)? > Data Frame and Spark SQL should allow filtering on key portion of incremental > parquet files > ------------------------------------------------------------------------------------------- > > Key: SPARK-6982 > URL: https://issues.apache.org/jira/browse/SPARK-6982 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Affects Versions: 1.3.0 > Reporter: Brad Willard > Labels: dataframe, sql > > I'm working with a 2.4 billion dataset. I just converted it to use the > incremental schema features of parquet added in 1.3.0 where you save > incremental files with key=X. > I'm currently saving files where the key is a timestamp like key=2015-01-01. > If I run a query, the key comes back as an attributes in the row. It would be > amazing to be able to do comparisons and filters on the key attribute to do > efficient queries between points in time and just skip the partitions of data > outside of a key range. > df = sql_context.parquetFile('/super_large_dataset_over time') > df.filter(df.key >= '2015-3-24').filter(df.key < '2015-04-01').count() > That job could then skip large portions of the dataset very quickly even if > the entire parquet file contains years of data. > Currently that will throw an error because key is not part of the parquet > schema even though it's returned in the rows. > However it does strangely work with the in clause which is my current work > around > df.where('key in ("2015-04-02", "2015-04-03")').count() > Job aborted due to stage failure: Task 122 in stage 6.0 failed 100 times, > most recent failure: Lost task 122.99 in stage 6.0 (TID 39498, > ip-XXXXXXXXXXXX): java.lang.IllegalArgumentException: Column [key] was not > found in schema! > at parquet.Preconditions.checkArgument(Preconditions.java:47) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41) > at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) > at > parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41) > at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) > at > parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) > at > parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org