WeichenXu123 opened a new pull request #25184: [SPARK-28431]Fix CSV datasource 
throw com.univocity.parsers.common.TextParsingException with large size message
URL: https://github.com/apache/spark/pull/25184
 
 
   ## What changes were proposed in this pull request?
   
   Fix CSV datasource throw com.univocity.parsers.common.TextParsingException 
with large size message, which will make log output consume large disk space.
   This issue is troublesome when sometimes we need parse CSV with large size 
column.
   
   I make a wrapper for CSVParser methods and catch TextParsingException and 
limit the exception message size, then re-throw the exception.
   
   ## How was this patch tested?
   
   Manually.
   ```
   val s = "a" * 40 * 1000000
   Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv")
   
   spark.read .option("maxCharsPerColumn", 30000000) 
.csv("/tmp/bogdan/es4196.csv").count
   ```
   Before:
   The thrown message will include error content of about 30MB size (The column 
size exceed the max value 30MB, so the error content include the whole parsed 
content, so it is 30MB).
   
   After:
   The thrown message will include error content like "...aaa...aa" (length is 
100), i.e. limit the content size to be 100.
   
   Please review https://spark.apache.org/contributing.html before opening a 
pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to