[
https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-26859:
----------------------------------
Fix Version/s: 2.3.4
> Fix field writer index bug in non-vectorized ORC deserializer
> -------------------------------------------------------------
>
> Key: SPARK-26859
> URL: https://issues.apache.org/jira/browse/SPARK-26859
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Ivan Vergiliev
> Assignee: Ivan Vergiliev
> Priority: Major
> Labels: correctness
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> There is a bug in the ORC deserialization code that, when triggered, results
> in completely wrong data being read. I've marked this as a Blocker as per the
> docs in https://spark.apache.org/contributing.html as it's a data correctness
> issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these
> columns are in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones
> added by the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not
> the one they're in, and
> - the old values from the null columns don't get cleared from the previous
> record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
> val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"),
> (8, 9, null)))
> val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified
> schema should result in reading the same values, with nulls for `col4`.
> Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and
> ends up in the third record as well; also, instead of `col2 = 9` in the last
> record as expected, we get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being
> read from the ORC file. The set of conditions under which it gets triggered
> is somewhat narrow so the set of affected users is probably limited. There
> are possibly also people that are affected but haven't realized it because
> the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in
> `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for
> review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle
> of the schema. This means that it can be worked around by only adding new
> columns at the end.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]