[jira] [Updated] (SPARK-26859) Fix field writer index bug in non-vectorized ORC deserializer

Dongjoon Hyun (JIRA) Thu, 08 Aug 2019 14:57:16 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-26859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-26859:
----------------------------------
    Fix Version/s: 2.3.4

> Fix field writer index bug in non-vectorized ORC deserializer
> -------------------------------------------------------------
>
>                 Key: SPARK-26859
>                 URL: https://issues.apache.org/jira/browse/SPARK-26859
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Ivan Vergiliev
>            Assignee: Ivan Vergiliev
>            Priority: Major
>              Labels: correctness
>             Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> There is a bug in the ORC deserialization code that, when triggered, results 
> in completely wrong data being read. I've marked this as a Blocker as per the 
> docs in https://spark.apache.org/contributing.html as it's a data correctness 
> issue.
> The bug is triggered when the following set of conditions are all met:
> - the non-vectorized ORC reader is being used;
> - a schema is explicitly specified when reading the ORC file
> - the provided schema has columns not present in the ORC file, and these 
> columns are in the middle of the schema
> - the ORC file being read contains null values in the columns after the ones 
> added by the schema.
> When all of these are met:
> - the internal state of the ORC deserializer gets messed up, and, as a result
> - the null values from the ORC file end up being set on wrong columns, not 
> the one they're in, and
> - the old values from the null columns don't get cleared from the previous 
> record.
> Here's a concrete example. Let's consider the following DataFrame:
> {code:scala}
>         val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), 
> (8, 9, null)))
>         val df = rdd.toDF("col1", "col2", "col3")
> {code}
> and the following schema:
> {code:scala}
> col1 int, col4 int, col2 int, col3 string
> {code}
> Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
> Saving this dataframe to ORC and then reading it back with the specified 
> schema should result in reading the same values, with nulls for `col4`. 
> Instead, we get the following back:
> {code:java}
> [1,null,2,abc]
> [4,null,5,def]
> [8,null,null,def]
> {code}
> Notice how the `def` from the second record doesn't get properly cleared and 
> ends up in the third record as well; also, instead of `col2 = 9` in the last 
> record as expected, we get the null that should've been in column 3 instead.
> *Impact*
> When this issue is triggered, it results in completely wrong results being 
> read from the ORC file. The set of conditions under which it gets triggered 
> is somewhat narrow so the set of affected users is probably limited. There 
> are possibly also people that are affected but haven't realized it because 
> the conditions are so obscure.
> *Bug details*
> The issue is caused by calling `setNullAt` with a wrong index in 
> `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for 
> review shortly.
> *Workaround*
> This bug is currently only triggered when new columns are added to the middle 
> of the schema. This means that it can be worked around by only adding new 
> columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-26859) Fix field writer index bug in non-vectorized ORC deserializer

Reply via email to