[
https://issues.apache.org/jira/browse/SPARK-27593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830747#comment-16830747
]
Hyukjin Kwon commented on SPARK-27593:
--------------------------------------
Malformed column is just optional additional information to see which record is
malformed. You can do that by checking if malformed column is null or not.
> CSV Parser returns 2 DataFrame - Valid and Malformed DFs
> --------------------------------------------------------
>
> Key: SPARK-27593
> URL: https://issues.apache.org/jira/browse/SPARK-27593
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Affects Versions: 2.4.2
> Reporter: Ladislav Jech
> Priority: Major
>
> When we process CSV in any kind of data warehouse, its common procedure to
> report corrupted records for audit purposes and feedback back to vendor, so
> they can enhance their procedure. CSV is no difference from XSD from
> perspective that it define a schema although in very limited way (in some
> cases only as number of columns without even headers, and we don't have
> types), but when I check XML document against XSD file, I get exact report of
> if the file is completely valid and if not I get exact report of what records
> are not following schema.
> Such feature will have big value in Spark for CSV, get malformed records into
> some dataframe, with line count (pointer within the data object), so I can
> log both pointer and real data (line/row) and trigger action on this
> unfortunate event.
> load() method could return Array of DFs (Valid, Invalid)
> PERMISSIVE MODE isn't enough as soon as it fill missing fields with nulls, so
> it is even harder to detect what is really wrong. Another approach at moment
> is to read both permissive and dropmalformed modes into 2 dataframes and
> compare those one against each other.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]