[
https://issues.apache.org/jira/browse/SPARK-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317768#comment-14317768
]
Yi Tian commented on SPARK-3365:
--------------------------------
The reason is Spark generated wrong schema for type {{List}} in
{{ScalaReflection.scala}}
for example:
the generated schema for type {{Seq\[String\]}} is:
{code}
{"name":"x","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}}
{code}
the generated schema for type {{List\[String\]}} is:
{code}
{"name":"x","type":{"type":"struct","fields":[]},"nullable":true,"metadata":{}}
{code}
The related code is
[here|https://github.com/apache/spark/blob/500dc2b4b3136029457e708859fe27da93b1f9e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L110]
The order of resolution is:
# UserCustomType
# Option\[\_\]
# Product
# Array\[Byte\]
# Array\[\_\]
# Seq\[\_\]
# Map\[\_, _\]
# String
# Timestamp
# java.sql.Date
# BigDecimal
# java.math.BigDecimal
# Decimal
# java.lang.Integer
# ...
I think the {{List}} type should belong to {{Seq\[\_\]}} pattern, so we should
move {{Product}} behind {{Seq\[\_\]}}.
May I open a PR for this issue?
> Failure to save Lists to Parquet
> --------------------------------
>
> Key: SPARK-3365
> URL: https://issues.apache.org/jira/browse/SPARK-3365
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Michael Armbrust
> Assignee: Cheng Lian
> Priority: Blocker
>
> Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding
> this)
> {code}
> scala> case class Test(x: List[String])
> defined class Test
> scala> sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile("bug")
> 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0
> in stage 0.0 (TID 0)
> java.lang.ArithmeticException: / by zero
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99)
> at
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:92)
> at
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]