Github user rtreffer commented on the pull request:
https://github.com/apache/spark/pull/6796#issuecomment-113628860
Ok, it looks like I can't open hive generated parquet files, but it looks
more like a type error.
```
scala> val hive = sqlContext.load("/home/rtreffer/work/hadoop/hive-parquet")
warning: there was one deprecation warning; re-run with -deprecation for
details
15/06/19 22:03:26 INFO ParquetFileReader: Initiating action with
parallelism: 5
hive: org.apache.spark.sql.DataFrame = [id: int, value: decimal(30,0)]
scala> hive.collect.foreach(println)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Removed broadcast_9_piece0 on
localhost:42189 in memory (size: 2.4 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(130208) called with
curMem=0, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10 stored as values in
memory (estimated size 127.2 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(14082) called with
curMem=130208, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_10_piece0 stored as
bytes in memory (estimated size 13.8 KB, free 265.0 MB)
15/06/19 22:03:35 INFO BlockManagerInfo: Added broadcast_10_piece0 in
memory on localhost:42189 (size: 13.8 KB, free: 265.1 MB)
15/06/19 22:03:35 INFO SparkContext: Created broadcast 10 from collect at
<console>:38
15/06/19 22:03:35 INFO SparkContext: Starting job: collect at <console>:38
15/06/19 22:03:35 INFO DAGScheduler: Got job 9 (collect at <console>:38)
with 1 output partitions (allowLocal=false)
15/06/19 22:03:35 INFO DAGScheduler: Final stage: ResultStage 11(collect at
<console>:38)
15/06/19 22:03:35 INFO DAGScheduler: Parents of final stage: List()
15/06/19 22:03:35 INFO DAGScheduler: Missing parents: List()
15/06/19 22:03:35 INFO DAGScheduler: Submitting ResultStage 11
(MapPartitionsRDD[45] at collect at <console>:38), which has no missing parents
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(5568) called with
curMem=144290, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11 stored as values in
memory (estimated size 5.4 KB, free 265.0 MB)
15/06/19 22:03:35 INFO MemoryStore: ensureFreeSpace(2964) called with
curMem=149858, maxMem=278019440
15/06/19 22:03:35 INFO MemoryStore: Block broadcast_11_piece0 stored as
bytes in memory (estimated size 2.9 KB, free 265.0 MB)
15/06/19 22:03:36 INFO BlockManagerInfo: Added broadcast_11_piece0 in
memory on localhost:42189 (size: 2.9 KB, free: 265.1 MB)
15/06/19 22:03:36 INFO SparkContext: Created broadcast 11 from broadcast at
DAGScheduler.scala:893
15/06/19 22:03:36 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 11 (MapPartitionsRDD[45] at collect at <console>:38)
15/06/19 22:03:36 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks
15/06/19 22:03:36 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID
36, localhost, PROCESS_LOCAL, 1471 bytes)
15/06/19 22:03:36 INFO Executor: Running task 0.0 in stage 11.0 (TID 36)
15/06/19 22:03:36 INFO ParquetRelation2$$anonfun$buildScan$1$$anon$1: Input
split: ParquetInputSplit{part:
file:/home/rtreffer/work/hadoop/hive-parquet/000000_0 start: 0 end: 874 length:
874 hosts: []}
15/06/19 22:03:36 INFO InternalParquetRecordReader: RecordReader
initialized will read a total of 33 records.
15/06/19 22:03:36 INFO InternalParquetRecordReader: at row 0. reading next
block
15/06/19 22:03:36 INFO InternalParquetRecordReader: block read in memory in
0 ms. row count = 33
15/06/19 22:03:36 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID
36)
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/home/rtreffer/work/hadoop/hive-parquet/000000_0
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
at scala.collection.AbstractIterator.to(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: optional
binary value (DECIMAL(30,0)) != optional fixed_len_byte_array(13) value
(DECIMAL(30,0))
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:106)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:98)
at
org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:389)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:88)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:62)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
at
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:149)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 28 more
15/06/19 22:03:36 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 36,
localhost): org.apache.parquet.io.ParquetDecodingException: Can not read value
at 0 in block -1 in file file:/home/rtreffer/work/hadoop/hive-parquet/000000_0
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
at scala.collection.AbstractIterator.to(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: optional
binary value (DECIMAL(30,0)) != optional fixed_len_byte_array(13) value
(DECIMAL(30,0))
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:106)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:98)
at
org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:389)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:88)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:62)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
at
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:149)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 28 more
15/06/19 22:03:36 ERROR TaskSetManager: Task 0 in stage 11.0 failed 1
times; aborting job
15/06/19 22:03:36 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks
have all completed, from pool
15/06/19 22:03:36 INFO TaskSchedulerImpl: Cancelling stage 11
15/06/19 22:03:36 INFO DAGScheduler: ResultStage 11 (collect at
<console>:38) failed in 0.021 s
15/06/19 22:03:36 INFO DAGScheduler: Job 9 failed: collect at <console>:38,
took 0.033099 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0
(TID 36, localhost): org.apache.parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
file:/home/rtreffer/work/hadoop/hive-parquet/000000_0
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:750)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:295)
at scala.collection.AbstractIterator.to(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1202)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:274)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1202)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested
schema is not compatible with the file schema. incompatible types: optional
binary value (DECIMAL(30,0)) != optional fixed_len_byte_array(13) value
(DECIMAL(30,0))
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:106)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:98)
at
org.apache.parquet.schema.PrimitiveType.accept(PrimitiveType.java:389)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:88)
at
org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:62)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
at
org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:149)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
... 28 more
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1285)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1276)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1275)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1275)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:749)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:749)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1484)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1445)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
```
Hm, could be that the spark decoder is too strict. There are various ways
to encode DECIMAL(30), and it looks like hive chooses fixed_len arrays, while I
prefer variable length arrays. Have to double check that.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]