[
https://issues.apache.org/jira/browse/SPARK-22455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-22455:
---------------------------------
Labels: bulk-closed (was: )
> Provide an option to store the exception records/files and reasons in log
> files when reading data from a file-based data source.
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22455
> URL: https://issues.apache.org/jira/browse/SPARK-22455
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 2.2.0
> Reporter: Sreenath Chothar
> Priority: Minor
> Labels: bulk-closed
>
> Provide an option to store the exception/bad records and reasons in log files
> when reading data from a file-based data source into a PySpark dataframe. Now
> only following three options are available:
> 1. PERMISSIVE : sets other fields to null when it meets a corrupted record,
> and puts the malformed string into a field configured by
> columnNameOfCorruptRecord.
> 2. DROPMALFORMED : ignores the whole corrupted records.
> 3. FAILFAST : throws an exception when it meets corrupted records.
> We could use first option to accumulate the corrupted records and output to a
> log file.But we can't use this option when input schema is inferred
> automatically. If the number of columns to read is too large, providing the
> complete schema with additional column for storing corrupted data is
> difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could
> provide an option to redirect the bad records to configured log file path
> with exception details.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]