Lorenzo Martini created SPARK-36565:
---------------------------------------
Summary: "Unscaled value too large for precision" while reading
simple parquet file readable with parquet-tools and pandas
Key: SPARK-36565
URL: https://issues.apache.org/jira/browse/SPARK-36565
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.1.2, 2.4.0
Reporter: Lorenzo Martini
Attachments: broken_parquet_file.parquet
I have a simple parquet file (attached to the ticket) with 2 columns
(array<string>, decimal) that can be read and viewed correctly using pandas or
parquet-tools. Reading the parquet file in spark (and pyspark) seems to work,
but calling `.show()` throws the exception (1) with
{code:java}
Caused by: java.lang.ArithmeticException: Unscaled value too large for
precision at.
{code}
Another interesting detail is that reading the parquet file and doing a select
on individual columns allows for `show()` to work correctly without throwing.
{code:java}
>>> repro = spark.read.parquet(".../broken_file.parquet")
>>> repro.printSchema()
root
|-- column_a: array (nullable = true)
| |-- element: string (containsNull = true)
|-- column_b: decimal(4,0) (nullable = true)
>>> repro.select("column_a").show()
+--------+
|column_a|
+--------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+--------+
>>> repro.select("column_b").show()
+--------+
|column_b|
+--------+
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
| 11590|
+--------+
>>> repro.show() // THIS ONE THROWS EXCEPTION (1)
{code}
Using `parquet-tools` shows the dataset correctly
{code:java}
>>> parquet-tools show broken_file.parquet
+------------+------------+
| column_a | column_b |
|------------+------------|
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
| | 11590 |
+------------+------------+
{code}
And the same with `pandas`
{code:java}
>>> import pandas as pd
>>> pd.read_parquet(".../broken_file.parquet")
column_a column_b
0 None 11590
1 None 11590
2 None 11590
3 None 11590
4 None 11590
5 None 11590
6 None 11590
7 None 11590
8 None 11590
9 None 11590
{code}
I have also verified this affects all versions of spark between 2.4.0 and 3.1.2
Here the Exception (1) thrown (sorry about the poor formatting, didn't seem to
manage to make it work):
{code:java}
>>> spark.version '3.1.2'
>>> df = spark.read.parquet(".../broken_parquet_file.parquet")
>>> df
DataFrame[column_a: array<string>, column_b: decimal(4,0)]
>>> df.show() 21/08/23 18:39:36
ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)/ 1]
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while
reading parquet files. One possible cause: Parquet column cannot be converted
in the corresponding files. Details: at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/Users/lmartini/Downloads/broken_parquet_file.parquet at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
... 19 more Caused by: java.lang.ArithmeticException: Unscaled value too large
for precision at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83) at
org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetDecimalConverter.decimalFromLong(ParquetRowConverter.scala:475)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2(ParquetRowConverter.scala:496)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2$adapted(ParquetRowConverter.scala:495)
at scala.Array$.tabulate(Array.scala:334) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.setDictionary(ParquetRowConverter.scala:495)
at
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 24 more 21/08/23 18:39:36 WARN TaskSetManager: Lost task 0.0 in stage 1.0
(TID 1) (192.168.0.39 executor driver):
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while
reading parquet files. One possible cause: Parquet column cannot be converted
in the corresponding files. Details: at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/Users/lmartini/Downloads/broken_parquet_file.parquet at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
... 19 more Caused by: java.lang.ArithmeticException: Unscaled value too large
for precision at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83) at
org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetDecimalConverter.decimalFromLong(ParquetRowConverter.scala:475)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2(ParquetRowConverter.scala:496)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2$adapted(ParquetRowConverter.scala:495)
at scala.Array$.tabulate(Array.scala:334) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.setDictionary(ParquetRowConverter.scala:495)
at
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 24 more 21/08/23 18:39:36 ERROR TaskSetManager: Task 0 in stage 1.0 failed
1 times; aborting job Traceback (most recent call last): File "<stdin>", line
1, in <module> File
"/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 484, in
show print(self._jdf.showString(n, 20, vertical)) File
"/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1304, in __call__ File
"/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in
deco return f(*a, **kw) File
"/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred
while calling o35.showString. : org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 1.0 (TID 1) (192.168.0.39 executor driver):
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while
reading parquet files. One possible cause: Parquet column cannot be converted
in the corresponding files. Details: at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832) Caused by:
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file file:/Users/lmartini/Downloads/broken_parquet_file.parquet at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
... 19 more Caused by: java.lang.ArithmeticException: Unscaled value too large
for precision at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83) at
org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetDecimalConverter.decimalFromLong(ParquetRowConverter.scala:475)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2(ParquetRowConverter.scala:496)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2$adapted(ParquetRowConverter.scala:495)
at scala.Array$.tabulate(Array.scala:334) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.setDictionary(ParquetRowConverter.scala:495)
at
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 24 more Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425) at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696) at
org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722) at
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685) at
org.apache.spark.sql.Dataset.head(Dataset.scala:2722) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2929) at
org.apache.spark.sql.Dataset.getRows(Dataset.scala:301) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:338) at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.base/java.lang.Thread.run(Thread.java:832) Caused by:
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while
reading parquet files. One possible cause: Parquet column cannot be converted
in the corresponding files. Details: at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
... 1 more Caused by: org.apache.parquet.io.ParquetDecodingException: Can not
read value at 0 in block -1 in file
file:/Users/lmartini/Downloads/broken_parquet_file.parquet at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
... 19 more Caused by: java.lang.ArithmeticException: Unscaled value too large
for precision at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83) at
org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:574) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetDecimalConverter.decimalFromLong(ParquetRowConverter.scala:475)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2(ParquetRowConverter.scala:496)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.$anonfun$setDictionary$2$adapted(ParquetRowConverter.scala:495)
at scala.Array$.tabulate(Array.scala:334) at
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetIntDictionaryAwareDecimalConverter.setDictionary(ParquetRowConverter.scala:495)
at
org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at
org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 24 more
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]