[jira] [Issue Comment Deleted] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

Brad Willard (JIRA) Tue, 30 Jun 2015 14:30:04 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-8128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brad Willard updated SPARK-8128:
--------------------------------
    Comment: was deleted

(was: This is only broken on queries. I can load the dataframe with the 
unquery-able parquet files, and re-save it as a new parquet file and the schema 
merging will then make a query-able parquet file.

So my current workaround is to generate the partial parquet files and use the 
schema merging to then make a new final parquet file which sucks, but at least 
I can query it.)

> Schema Merging Broken: Dataframe Fails to Recognize Column in Schema
> --------------------------------------------------------------------
>
>                 Key: SPARK-8128
>                 URL: https://issues.apache.org/jira/browse/SPARK-8128
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0
>            Reporter: Brad Willard
>
> I'm loading a folder of parquet files with about 600 parquet files and 
> loading it into one dataframe so schema merging is involved. There is some 
> bug with the schema merging that you print the schema and it shows all the 
> attributes. However when you run a query and filter on that attribute is 
> errors saying it's not in the schema. The query is incorrectly going to one 
> of the parquet files that does not have that attribute.
> sdf = sql_context.parquet('/parquet/big_data_folder')
> sdf.printSchema()
> root
>  \|-- _id: string (nullable = true)
>  \|-- addedOn: string (nullable = true)
>  \|-- attachment: string (nullable = true)
>  .......
> \|-- items: array (nullable = true)
>  \|    |-- element: struct (containsNull = true)
>  \|    |    |-- _id: string (nullable = true)
>  \|    |    |-- addedOn: string (nullable = true)
>  \|    |    |-- authorId: string (nullable = true)
>  \|    |    |-- mediaProcessingState: long (nullable = true)
>  \|-- mediaProcessingState: long (nullable = true)
>  \|-- title: string (nullable = true)
>  \|-- key: string (nullable = true)
> sdf.filter(sdf.mediaProcessingState == 3).count()
> causes this exception
> Py4JJavaError: An error occurred while calling o67.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in 
> stage 4.0 (TID 70565, XXXXXXXXXXXXXXX): java.lang.IllegalArgumentException: 
> Column [mediaProcessingState] was not found in schema!
>     at parquet.Preconditions.checkArgument(Preconditions.java:47)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>     at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>     at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
>     at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
>     at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>     at 
> parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>     at 
> parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>     at 
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>     at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>     at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
>     at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
>     at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>     at org.apache.spark.scheduler.Task.run(Task.scala:64)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> You also get the same error if you register it as a temp table and try to 
> execute the same sql query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Issue Comment Deleted] (SPARK-8128) Schema Merging Broken: Dataframe Fails to Recognize Column in Schema

Reply via email to