Correct, to me it looks like a Spark bug 
https://issues.apache.org/jira/browse/SPARK-51821 that may be hard to trigger 
and is reproduce using the test case provided in 
https://github.com/apache/spark/pull/50594:

1. Spark UninterruptibleThread “task” is interrupted by “test” thread while 
“task” thread is blocked in NIO operation.
2. NIO operation is interruptible (channel  is InterruptibleChannel). In case 
of Parquet, it is WritableByteChannel.
3. As part of handling InterruptedException, channel interrupts the “task” 
thread 
(https://github.com/apache/hadoop/blob/5770647dc73d552819963ba33f50be518058ee03/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1029)

Thank you,

Vlad


On Apr 22, 2025, at 1:53 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

Correct me if I'm wrong: this is a long-standing Spark bug that is very hard to 
trigger, but the new Parquet version happens to hit the trigger condition and 
exposes the bug. If this is the case, I'm +1 to fix the Spark bug instead of 
downgrading the Parquet version.

Let's move the technical discussions to 
https://github.com/apache/spark/pull/50594.

On Tue, Apr 22, 2025 at 11:20 AM Manu Zhang 
<owenzhang1...@gmail.com<mailto:owenzhang1...@gmail.com>> wrote:
I don't think PARQUET-2432 has any issue itself. It looks to have triggered a 
deadlock case like https://github.com/apache/spark/pull/50594.
I'd suggest that we fix forward if possible.

Thanks,
Manu

On Mon, Apr 21, 2025 at 11:19 PM Rozov, Vlad <vro...@amazon.com.invalid> wrote:
The deadlock is reproducible without Parquet. Please see 
https://github.com/apache/spark/pull/50594.

Thank you,

Vlad

On Apr 21, 2025, at 1:59 AM, Cheng Pan 
<pan3...@gmail.com<mailto:pan3...@gmail.com>> wrote:

The deadlock is introduced by PARQUET-2432(1.14.0), if we decide downgrade, the 
latest workable version is Parquet 1.13.1.

Thanks,
Cheng Pan



On Apr 21, 2025, at 16:53, Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:

+1 to downgrade to Parquet 1.15.0 for Spark 4.0. According to 
https://github.com/apache/spark/pull/50583#issuecomment-2815243571 , the 
Parquet CVE does not affect Spark.

On Mon, Apr 21, 2025 at 2:45 PM Hyukjin Kwon 
<gurwls...@apache.org<mailto:gurwls...@apache.org>> wrote:
That's nice but we need to wait for them to release, and upgrade right? Let's 
revert the parquet upgrade out of 4.0 branch since we're not directly affected 
by the CVE anyway.

On Mon, 21 Apr 2025 at 15:42, Yuming Wang 
<yumw...@apache.org<mailto:yumw...@apache.org>> wrote:
It seems this patch(https://github.com/apache/parquet-java/pull/3196) can avoid 
deadlock issue if using Parquet 1.15.1.

On Wed, Apr 16, 2025 at 5:39 PM Niranjan Jayakar <n...@databricks.com.invalid> 
wrote:
I found another bug introduced in 4.0 that breaks Spark connect client x server 
compatibility: https://github.com/apache/spark/pull/50604.

Once merged, this should be included in the next RC.

On Thu, Apr 10, 2025 at 5:21 PM Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 4.0.0.

The vote is open until April 15 (PST) and passes if a majority +1 PMC votes are 
cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v4.0.0-rc4 (commit 
e0801d9d8e33cd8835f3e3beed99a3588c16b776)
https://github.com/apache/spark/tree/v4.0.0-rc4

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc4-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1480/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc4-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

This release is using the release script of the tag v4.0.0-rc4.

FAQ

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).



Reply via email to