[jira] [Commented] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

Hyukjin Kwon (JIRA) Tue, 30 Apr 2019 16:21:22 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830747#comment-16830747
 ]


Hyukjin Kwon commented on SPARK-27593:
--------------------------------------

Malformed column is just optional additional information to see which record is 
malformed. You can do that by checking if malformed column is null or not.

> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> --------------------------------------------------------
>
>                 Key: SPARK-27593
>                 URL: https://issues.apache.org/jira/browse/SPARK-27593
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.4.2
>            Reporter: Ladislav Jech
>            Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to 
> report corrupted records for audit purposes and feedback back to vendor, so 
> they can enhance their procedure. CSV is no difference from XSD from 
> perspective that it define a schema although in very limited way (in some 
> cases only as number of columns without even headers, and we don't have 
> types), but when I check XML document against XSD file, I get exact report of 
> if the file is completely valid and if not I get exact report of what records 
> are not following schema. 
> Such feature will have big value in Spark for CSV, get malformed records into 
> some dataframe, with line count (pointer within the data object), so I can 
> log both pointer and real data (line/row) and trigger action on this 
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so 
> it is even harder to detect what is really wrong. Another approach at moment 
> is to read both permissive and dropmalformed modes into 2 dataframes and 
> compare those one against each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27593) CSV Parser returns 2 DataFrame - Valid and Malformed DFs

Reply via email to