CodingCat opened a new pull request #30972:
URL: https://github.com/apache/spark/pull/30972
### What changes were proposed in this pull request?
csv writer actually has an implicit limit on column name length due to
univocity-parser,
when we initialize a writer
https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,
it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java
eventually
(https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)
in that stringCache.get, it has a maxStringLength cap
https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104
which is 1024 by default
we do not expose this as configurable option, leading to NPE when we have a
column name larger than 1024,
```
[info] Cause: java.lang.NullPointerException:
[info] at
com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
[info] at
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
[info] at
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
[info] at
org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
[info] at
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
[info] at
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44)
[info] at
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
[info] at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
[info] at
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
[info] at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
[info] at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
```
it could be reproduced by a simple unit test
```
val row1 = Row("a")
val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
df.repartition(1)
.write
.option("header", "true")
.option("maxColumnNameLength", 1025)
.csv(dataPath)
```
### Why are the changes needed?
it will lead to NPE unexpectedly
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
UT
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]