[GitHub] [spark] TonyDoen opened a new pull request #35990: [SPARK-38639] Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

GitBox Mon, 28 Mar 2022 04:02:54 -0700


TonyDoen opened a new pull request #35990:
URL: https://github.com/apache/spark/pull/35990



   
   ### What changes were proposed in this pull request?
   This PR adds a "spark.sql.hive.ignoreCorruptRecord" to fill out the 
functionality that users can query successfully in dirty data(mixed schema in 
one table).
   
   
   ### Why are the changes needed?
   There's an existing flag "spark.sql.files.ignoreCorruptFiles" and 
"spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads 
from files that have been corrupted, but it still allows the query to fail on 
sequence files.
   
   Being able to ignore corrupt record is useful in the scenarios that users 
want to query successfully in dirty data(mixed schema in one table).
   
   We would like to add a "spark.sql.hive.ignoreCorruptRecord" to fill out the 
functionality.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, add new config: "spark.sql.hive.ignoreCorruptRecord"
   
   
   ### How was this patch tested?
   Manually tested in local and existed UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] TonyDoen opened a new pull request #35990: [SPARK-38639] Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

Reply via email to