[ 
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830596#comment-16830596
 ] 

Ladislav Jech commented on SPARK-27593:
---------------------------------------

[~hyukjin.kwon] - Then return optionally just an array of line numbers which 
are malformed. Such array is can be logged alongside with log of processed 
file. Otherwise one must do load DF with permissive mode and load another DF 
with malformed and do at least number of malformed row count. Still its just 
count, but in malformed mode you have the line index available, so why just 
don't expose it as optional variable.

It can return into variable passed, like:

List<int> malformedRecords = new ArrayList<>();

sqlContext.read.format("CSV").option("mode","malformed").option("malformedRecords",
 malformedRecords).load("S3://...")

And object malformedRecords will be updated with row index...

> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> --------------------------------------------------------
>
>                 Key: SPARK-27593
>                 URL: https://issues.apache.org/jira/browse/SPARK-27593
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.4.2
>            Reporter: Ladislav Jech
>            Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to 
> report corrupted records for audit purposes and feedback back to vendor, so 
> they can enhance their procedure. CSV is no difference from XSD from 
> perspective that it define a schema although in very limited way (in some 
> cases only as number of columns without even headers, and we don't have 
> types), but when I check XML document against XSD file, I get exact report of 
> if the file is completely valid and if not I get exact report of what records 
> are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into 
> some dataframe, with line count (pointer within the data object), so I can 
> log both pointer and real data (line/row) and trigger action on this 
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so 
> it is even harder to detect what is really wrong. Another approach at moment 
> is to read both permissive and dropmalformed modes into 2 dataframes and 
> compare those one against each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to