[
https://issues.apache.org/jira/browse/SPARK-48950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882998#comment-17882998
]
Thomas Newton edited comment on SPARK-48950 at 9/19/24 12:08 PM:
-----------------------------------------------------------------
Sorry for the delay. I tried `3.5.2` and I can definitely still reproduce the
problem. Its quite long winded but here is my reproduce:
First generate some test data on Azure blob storage (I'm using a storage
account with hierarchical namespace a.k.a ADLS gen2).
[^generate_data_to_reproduce_spark-48950.ipynb]
Now run the reproduce. I'm running on kubernetes using a cluster with five 32
core nodes and dynamic allocation disabled. [^reproduce_spark-48950.py]
Given how rare this problem is this runs the same thing on repeat until the
problem is detected but when I run it, it reliably reproduces within the 100
attempts.
I probably should have mentioned before but I'm using hadoop 3.3.4 as per
[https://github.com/apache/spark/blob/bb7846dd487f259994fdc69e18e03382e3f64f42/pom.xml#L125C21-L125C26]
was (Author: JIRAUSER291600):
Sorry for the delay. I tried `3.5.2` and I can definitely still reproduce the
problem. Its quite long winded but here is my reproduce:
First generate some test data on Azure blob storage (I'm using a storage
account with hierarchical namespace a.k.a ADLS gen2).
[^generate_data_to_reproduce_spark-48950.ipynb]
Now run the reproduce. I'm running on kubernetes using a cluster with five 32
core nodes and dynamic allocation disabled. [^reproduce_spark-48950.py]
I probably should have mentioned before but I'm using hadoop 3.3.4 as per
[https://github.com/apache/spark/blob/bb7846dd487f259994fdc69e18e03382e3f64f42/pom.xml#L125C21-L125C26]
> Corrupt data from parquet scans
> -------------------------------
>
> Key: SPARK-48950
> URL: https://issues.apache.org/jira/browse/SPARK-48950
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 3.5.0, 4.0.0, 3.5.1, 3.5.2
> Environment: Spark 3.5.0
> Running on kubernetes
> Using Azure Blob storage with hierarchical namespace enabled
> Reporter: Thomas Newton
> Priority: Major
> Labels: correctness
> Attachments: example_task_errors.txt,
> generate_data_to_reproduce_spark-48950.ipynb, job_dag.png,
> reproduce_spark-48950.py, sql_query_plan.png
>
>
> Its very rare and non-deterministic but since Spark 3.5.0 we have started
> seeing a correctness bug in parquet scans when using the vectorized reader.
> We've noticed this on double type columns where occasionally small groups
> (typically 10s to 100s) of rows are replaced with crazy values like
> `-1.29996470e+029, 3.56717569e-184, 7.23323243e+307, -1.05929677e+045,
> -7.60562076e+240, -3.18088886e-064, 2.89435993e-116`. I think this is the
> result of interpreting uniform random bits as a double type. Most of my
> testing has been on an array of double type column but we have also seen it
> on un-nested plain double type columns.
> I've been testing this by adding a filter that should return zero results but
> will return non-zero if the parquet scan has problems. I've attached
> screenshots of this from the Spark UI.
> I did a `git bisect` and found that the problem starts with
> [https://github.com/apache/spark/pull/39950], but I haven't yet understood
> why. Its possible that this change is fine but it reveals a problem
> elsewhere? I did also notice [https://github.com/apache/spark/pull/44853]
> which appears to be a different implementation of the same thing so maybe
> that could help.
> Its not a major problem by itself but another symptom appears to be that
> Parquet scan tasks fail at a rate of approximately 0.03% with errors like
> those in the attached `example_task_errors.txt`. If I revert
> [https://github.com/apache/spark/pull/39950] I get exactly 0 task failures on
> the same test.
>
> The problem seems to be a bit dependant on how the parquet files happen to be
> organised on blob storage so I don't yet have a reproduce that I can share
> that doesn't depend on private data.
> I tested on a pre-release 4.0.0 and the problem was still present.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]