[
https://issues.apache.org/jira/browse/SPARK-23225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marco Gaido resolved SPARK-23225.
---------------------------------
Resolution: Duplicate
> Spark is infering decimal values with wrong precision
> -----------------------------------------------------
>
> Key: SPARK-23225
> URL: https://issues.apache.org/jira/browse/SPARK-23225
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: Nacho García Fernández
> Priority: Major
> Fix For: 2.1.1
>
>
> Hi there.
> I'm reading a CSV file with data exported by DB2. This CSV file is about 1.6M
> records. Most of the records are actually Decimal(1,0), but about 200 of them
> are Decimal(2,0)
> The file looks like:
>
> {code:java}
> +-------------+
> |MY_COLUMN|
> +-------------+
> | +0001.|
> | +0010.|
> | +0011.|
> | +0002.|
> .........
> {code}
>
> Everything is OK when I read the input file with the following line (actually
> I'm not calling any spark action yet):
> {code:java}
> val test = spark.read.option("delimiter", ";").option("inferSchema",
> "true").option("header", "true"). csv("testfile")
> {code}
>
> After calling a simple action like *test.distinct* or *test.count*, Spark is
> throwing the following exception:
>
> {code:java}
> [Stage 58:> (0 + 4) / 65][Stage 60:> (0 + 0) / 3][Stage 61:> (0 + 0) /
> 7]2018-01-24 11:01:27 ERROR org.apache.spark.executor.Executor:91 - Exception
> in task 1.0 in stage 58.0 (TID 6614) java.lang.IllegalArgumentException:
> requirement failed: Decimal precision 2 exceeds max precision 1 at
> scala.Predef$.require(Predef.scala:224) at
> org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at
> org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at
> org.apache.spark.scheduler.Task.run(Task.scala:99) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 WARN
> org.apache.spark.scheduler.TaskSetManager:66 - Lost task 1.0 in stage 58.0
> (TID 6614, localhost, executor driver): java.lang.IllegalArgumentException:
> requirement failed: Decimal precision 2 exceeds max precision 1 at
> scala.Predef$.require(Predef.scala:224) at
> org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at
> org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at
> org.apache.spark.scheduler.Task.run(Task.scala:99) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 ERROR
> org.apache.spark.executor.Executor:91 - Exception in task 2.0 in stage 58.0
> (TID 6615) java.lang.IllegalArgumentException: requirement failed: Decimal
> precision 3 exceeds max precision 2 at
> scala.Predef$.require(Predef.scala:224) at
> org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at
> org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at
> org.apache.spark.scheduler.Task.run(Task.scala:99) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) 2018-01-24 11:01:27 ERROR
> org.apache.spark.executor.Executor:91 - Exception in task 3.0 in stage 58.0
> (TID 6616) java.lang.IllegalArgumentException: requirement failed: Decimal
> precision 3 exceeds max precision 2 at
> scala.Predef$.require(Predef.scala:224) at
> org.apache.spark.sql.types.Decimal.set(Decimal.scala:113) at
> org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:426) at
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:273)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
> {code}
>
>
> The issue is that Spark is infering this column as *DecimalType(1) (I
> checked it out with the printSchema),* so it fails whenever I try to filter
> out or carry out some spark actions.
> I cannot understand why Spark is failing infering this column (maybe Spark is
> based on data samplings when infering schemas? It makes no sense if Spark
> actually reads the input file twice when inferSchema is true).
> Any help is welcome.
> Thanks in advance
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]