[jira] [Created] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

Sreenath Chothar (JIRA) Mon, 06 Nov 2017 04:53:59 -0800

Sreenath Chothar created SPARK-22455:
----------------------------------------


             Summary: Provide an option to store the exception records/files 
and reasons in log files when reading data from a file-based data source.
                 Key: SPARK-22455
                 URL: https://issues.apache.org/jira/browse/SPARK-22455
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 2.2.0
            Reporter: Sreenath Chothar


Provide an option to store the exception/bad records and reasons in log files 
when reading data from a file-based data source into a PySpark dataframe. Now 
only following three options are available:
1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and 
puts the malformed string into a field configured by columnNameOfCorruptRecord.
2. DROPMALFORMED : ignores the whole corrupted records.
3. FAILFAST : throws an exception when it meets corrupted records.

We could use first option to accumulate the corrupted records and output to a 
log file.But we can't use this option when input schema is inferred 
automatically. If the number of columns to read is too large, providing the 
complete schema with additional column for storing corrupted data is difficult. 
Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an 
option to redirect the bad records to configured log file path with exception 
details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

Reply via email to