WeichenXu123 opened a new pull request #25184: [SPARK-28431]Fix CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message URL: https://github.com/apache/spark/pull/25184 ## What changes were proposed in this pull request? Fix CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message, which will make log output consume large disk space. This issue is troublesome when sometimes we need parse CSV with large size column. I make a wrapper for CSVParser methods and catch TextParsingException and limit the exception message size, then re-throw the exception. ## How was this patch tested? Manually. ``` val s = "a" * 40 * 1000000 Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv") spark.read .option("maxCharsPerColumn", 30000000) .csv("/tmp/bogdan/es4196.csv").count ``` Before: The thrown message will include error content of about 30MB size (The column size exceed the max value 30MB, so the error content include the whole parsed content, so it is 30MB). After: The thrown message will include error content like "...aaa...aa" (length is 100), i.e. limit the content size to be 100. Please review https://spark.apache.org/contributing.html before opening a pull request.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
