[
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brad Willard updated SPARK-8128:
--------------------------------
Comment: was deleted
(was: This is only broken on queries. I can load the dataframe with the
unquery-able parquet files, and re-save it as a new parquet file and the schema
merging will then make a query-able parquet file.
So my current workaround is to generate the partial parquet files and use the
schema merging to then make a new final parquet file which sucks, but at least
I can query it.)
> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> --------------------------------------------------------------------
>
> Key: SPARK-8128
> URL: https://issues.apache.org/jira/browse/SPARK-8128
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 1.3.0, 1.3.1, 1.4.0
> Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and
> loading it into one dataframe so schema merging is involved. There is some
> bug with the schema merging that you print the schema and it shows all the
> attributes. However when you run a query and filter on that attribute is
> errors saying it's not in the schema. The query is incorrectly going to one
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
> \|-- _id: string (nullable = true)
> \|-- addedOn: string (nullable = true)
> \|-- attachment: string (nullable = true)
> .......
> \|-- items: array (nullable = true)
> \| |-- element: struct (containsNull = true)
> \| | |-- _id: string (nullable = true)
> \| | |-- addedOn: string (nullable = true)
> \| | |-- authorId: string (nullable = true)
> \| | |-- mediaProcessingState: long (nullable = true)
> \|-- mediaProcessingState: long (nullable = true)
> \|-- title: string (nullable = true)
> \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in
> stage 4.0 (TID 70565, XXXXXXXXXXXXXXX): java.lang.IllegalArgumentException:
> Column [mediaProcessingState] was not found in schema!
> at parquet.Preconditions.checkArgument(Preconditions.java:47)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> at
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
> at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
> at
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
> at
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
> at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> You also get the same error if you register it as a temp table and try to
> execute the same sql query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]