Sreenath Chothar created SPARK-22455:
----------------------------------------
Summary: Provide an option to store the exception records/files
and reasons in log files when reading data from a file-based data source.
Key: SPARK-22455
URL: https://issues.apache.org/jira/browse/SPARK-22455
Project: Spark
Issue Type: Improvement
Components: Input/Output
Affects Versions: 2.2.0
Reporter: Sreenath Chothar
Provide an option to store the exception/bad records and reasons in log files
when reading data from a file-based data source into a PySpark dataframe. Now
only following three options are available:
1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and
puts the malformed string into a field configured by columnNameOfCorruptRecord.
2. DROPMALFORMED : ignores the whole corrupted records.
3. FAILFAST : throws an exception when it meets corrupted records.
We could use first option to accumulate the corrupted records and output to a
log file.But we can't use this option when input schema is inferred
automatically. If the number of columns to read is too large, providing the
complete schema with additional column for storing corrupted data is difficult.
Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an
option to redirect the bad records to configured log file path with exception
details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]