[jira] [Created] (SPARK-48950) Corrupt data from parquet scans

Thomas Newton (Jira) Fri, 19 Jul 2024 14:48:01 -0700

Thomas Newton created SPARK-48950:
-------------------------------------

             Summary: Corrupt data from parquet scans
                 Key: SPARK-48950
                 URL: https://issues.apache.org/jira/browse/SPARK-48950
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.5.1, 3.5.0, 4.0.0
         Environment: Spark 3.5.0


Running on kubernetes

Using Azure Blob storage with hierarchical namespace enabled 
            Reporter: Thomas Newton
         Attachments: example_task_errors.txt

Its very rare and non-deterministic but since Spark 3.5.0 we have started 
seeing a correctness bug in parquet scans when using the vectorized reader. 

We've noticed this on double type columns where occasionally small groups 
(typically 10s to 100s) of rows are replaced with crazy values like 
`-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045, 
-7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the 
result of interpreting uniform random bits as a double type. Most of my testing 
has been on an array of double type column but we have also seen it on 
un-nested plain double type columns. 

I've been testing this by adding a filter that should return zero results but 
will return non-zero if the parquet scan has problems. 

Query plan that reproduces:
!image-2024-07-19-22-31-35-210.png|width=260,height=493!!image-2024-07-19-22-32-10-822.png!

I did a `git bisect` and found that the problem starts with 
[https://github.com/apache/spark/pull/39950], but I haven't yet understood why. 
Its possible that this change is fine but it reveals a problem elsewhere? I did 
also notice  [https://github.com/apache/spark/pull/44853] which appears to be a 
different implementation of the same thing so maybe that could help. 

Its not a major problem by itself but another symptom appears to be that 
Parquet scan tasks fail at a rate of approximately 0.03% with errors like 
[https://drive.google.com/file/d/1saIlabCNpw56vknV7U09YSSYZMWz_WJZ/view?usp=sharing].
 If I revert [https://github.com/apache/spark/pull/39950] I get exactly 0 task 
failures on the same test. 

 

The problem seems to be a bit dependant on how the parquet files happen to be 
organised on blob storage so I don't yet have a reproduce that I can share that 
doesn't depend on private data. 

I tested on a pre-release 4.0.0 and the problem was still present. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-48950) Corrupt data from parquet scans

Reply via email to