[GitHub] [spark] CodingCat opened a new pull request #30972: [SPARK-33940] allow config the max column name length

GitBox Tue, 29 Dec 2020 23:50:43 -0800


CodingCat opened a new pull request #30972:
URL: https://github.com/apache/spark/pull/30972



   ### What changes were proposed in this pull request?
   csv writer actually has an implicit limit on column name length due to 
univocity-parser, 
   
    
   
   when we initialize a writer 
https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,
 it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
eventually 
(https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)
   
    
   
   in that stringCache.get, it has a maxStringLength cap 
https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104
 which is 1024 by default
   
    
   
   we do not expose this as configurable option, leading to NPE when we have a 
column name larger than 1024, 
   
    
   
   ```
   
   [info]   Cause: java.lang.NullPointerException:
   
   [info]   at 
com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
   
   [info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
   
   [info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
   
   [info]   at 
org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:44)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
   
   [info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
   
   ```
   
    
   
   it could be reproduced by a simple unit test
   
    
   
   ```
   
   val row1 = Row("a")
   val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
   val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
   df.repartition(1)
   .write
   .option("header", "true")
   .option("maxColumnNameLength", 1025)
   .csv(dataPath)
   
   ```
   
   
   ### Why are the changes needed?
   
   it will lead to NPE unexpectedly
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] CodingCat opened a new pull request #30972: [SPARK-33940] allow config the max column name length

Reply via email to