[
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DB Tsai updated SPARK-9442:
---------------------------
Description:
I am counting how many records in my nested parquet file with this schema,
{code}
scala> u1aTesting.printSchema
root
|-- profileId: long (nullable = true)
|-- country: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- videoId: long (nullable = true)
| | |-- date: long (nullable = true)
| | |-- label: double (nullable = true)
| | |-- weight: double (nullable = true)
| | |-- features: vector (nullable = true)
{code}
and the number of the records in the nested data array is around 10k, and each
of the parquet file is around 600MB. The total size is around 120GB.
I am doing a simple count
{code}
scala> u1aTesting.count
parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in
file
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}
> java.lang.ArithmeticException: / by zero when reading Parquet
> -------------------------------------------------------------
>
> Key: SPARK-9442
> URL: https://issues.apache.org/jira/browse/SPARK-9442
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.1
> Reporter: DB Tsai
>
> I am counting how many records in my nested parquet file with this schema,
> {code}
> scala> u1aTesting.printSchema
> root
> |-- profileId: long (nullable = true)
> |-- country: string (nullable = true)
> |-- data: array (nullable = true)
> | |-- element: struct (containsNull = true)
> | | |-- videoId: long (nullable = true)
> | | |-- date: long (nullable = true)
> | | |-- label: double (nullable = true)
> | | |-- weight: double (nullable = true)
> | | |-- features: vector (nullable = true)
> {code}
> and the number of the records in the nested data array is around 10k, and
> each of the parquet file is around 600MB. The total size is around 120GB.
> I am doing a simple count
> {code}
> scala> u1aTesting.count
> parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in
> file
> hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: / by zero
> at
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
> ... 21 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]