[ https://issues.apache.org/jira/browse/SPARK-34422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-34422. ---------------------------------- Resolution: Not A Problem Hm, no I am not sure this is a problem in Spark. The semantics are different. The partial result row in Spark does not already include the col for the corrupted record, whereas in the spark-xml representation it does (hence the bug there). Closing this as when I 'fixed' it it caused test failures, which convinced me after debugging that it's not the same situation. > CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial > result row > --------------------------------------------------------------------------------------- > > Key: SPARK-34422 > URL: https://issues.apache.org/jira/browse/SPARK-34422 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.7, 3.0.1, 3.1.1 > Reporter: Sean R. Owen > Assignee: Sean R. Owen > Priority: Major > > (This was actually found and fixed in spark-xml, which copied some Spark code > for handling bad records. See > https://github.com/databricks/spark-xml/issues/517 ) > When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive > mode, it can return a partial result of values that were successfully parsed, > along with the problem input in a new 'corrupt record' column. > However the logic in FailureSafeParser that copies the partial results to the > resulting Row has an off-by-one error that arises when the catalyst > projection puts the 'corrupt record' column anywhere but the last column, > which can readily happen. This could mean the resulting partial results are > wrong, or, that processing the bad record in permissive mode fails entirely, > if the resulting elements don't happen to match the schema of the result. > The partial results are usually not that useful, so being wrong isn't a huge > deal, but, failing entirely in permissive mode is a problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org