Re: [discuss] Pinning PySpark dependencies?

Holden Karau Sun, 17 May 2026 09:52:53 -0700

I am at PyCon USA Today and the PyPi head just did a call out to audit and
pin dependencies because the supply chain attacks are increasing hockey
stick style.


I think we don’t need to pin just yet but let’s add publishing the package
versions we built with during CI.


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <[email protected]>
wrote:

> I think we should do something in response to the growing supply chain
> attacks rather than just leaving the problem to users. One alternative we
> could consider for Python specifically is an install target with upper
> bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This
> wouldn't impact regular use, and seems like it would solve the other
> problems with publishing lock files, etc. As others have mentioned, this
> wouldn't *guarantee* security, but it would provide meaningful protection
> against the worst offenders we've recently seen.
>
> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:
>
>> > How about as a compromise, we publish (but don’t lock to) the pip
>> freeze outputs of the venvs we use for testing?
>>
>> > Where do you propose to publish? Spark website? Maybe in our github
>> repo somewhere?
>>
>> > I was thinking just in the publisher artifacts directory we already do.
>>
>> +1, I'm fine with any approach, as long as it provides sufficient info to
>> let user know which exactly version of dependencies was used for testing.
>>
>> For Java/Scala, we have a script[1] generated dependency list in code
>> repo, at [2]
>>
>> [1]
>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>> [2]
>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>
>> Thanks,
>> Cheng Pan
>>
>>
>>
>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote:
>>
>> I was thinking just in the publisher artifacts directory we already do.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]>
>> wrote:
>>
>>> Where do you propose to publish? Spark website? Maybe in our github repo
>>> somewhere? For python packages, users rarely look for artifacts (and it's
>>> difficult to find).
>>>
>>> Tian
>>>
>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> I hear that. How about as a compromise, we publish (but don’t lock to)
>>>> the pip freeze outputs of the venvs we use for testing?
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>>> [email protected]> wrote:
>>>>
>>>>> I think supply chain attacks are a problem, but I don’t think we want
>>>>> to be on the hook for a solution here, even if it’s meant just for our
>>>>> project.
>>>>>
>>>>> There are “good enough” approaches available today for Python that
>>>>> mitigate most of the risk by excluding recent releases when resolving what
>>>>> package versions to install.
>>>>>
>>>>> uv offers exclude-newer
>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip
>>>>> offers uploaded-prior-to
>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>> Poetry has an issue open
>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar
>>>>> feature, plus at least one open PR to close it.
>>>>>
>>>>> Users concerned about supply chain attacks would probably get better
>>>>> results from using these options as compared to installing pinned
>>>>> dependencies provided by the projects they use.
>>>>>
>>>>> Nick
>>>>>
>>>>>
>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>> So I think we can ship it as an optional distribution element (it's
>>>>> literally just another file folks can choose to download/use if they 
>>>>> want).
>>>>>
>>>>> Asking users is an idea too, I could put together a survey if we want?
>>>>>
>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1,
>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit
>>>>>> and we don't need to focus on the syntax.
>>>>>>
>>>>>> I don't believe we can ship pyspark with a env lock file. That's what
>>>>>> users do in their own projects. It's not part of python package system.
>>>>>> What users do is normally install packages, test it out, then lock it 
>>>>>> with
>>>>>> either pip or uv - generate a lock file for all dependencies and use it
>>>>>> across their systems. It's not common for packages to list out a "known
>>>>>> working dependency list" for users.
>>>>>>
>>>>>> However, if we really want to try it out, we can do something like
>>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark
>>>>>> requires with a pinned version. If our user needs an out-of-box solution
>>>>>> they can do that. We can also collect feedbacks and see the sentiment 
>>>>>> from
>>>>>> users.
>>>>>>
>>>>>> Tian
>>>>>>
>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote:
>>>>>>
>>>>>>> > If we consider PySpark the dominant package - meaning that if a
>>>>>>> user employs it, it must be the most important element in their project 
>>>>>>> and
>>>>>>> everything else must comply with it - pinning versions might be viable.
>>>>>>>
>>>>>>> This is not always true, but definitely a major case.
>>>>>>>
>>>>>>> > I'm not familiar with Java dependency solutions or how users use
>>>>>>> spark with Java
>>>>>>>
>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>> version,
>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it 
>>>>>>> works.
>>>>>>>
>>>>>>> In short, in Java/Scala dependency management, the pinned version is
>>>>>>> more like a suggested version, it's easy to override by users.
>>>>>>>
>>>>>>> As Owen pointed out, things are completely different in Python
>>>>>>> world, both pinned version and latest version seems not ideal, then
>>>>>>>
>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>
>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe
>>>>>>> we should add a disclaimer that this compatibility only holds under the
>>>>>>> assumption that 3rd-party packages strictly adhere to semantic 
>>>>>>> versioning.
>>>>>>>
>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>> resolved
>>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>>> It's
>>>>>>> what env lock flies are for.
>>>>>>>
>>>>>>> We definitely need such a dependency list in PySpark release, it's
>>>>>>> really important for users to set up a reproducible environment after 
>>>>>>> the
>>>>>>> release several years, and this is also a good reference for users who
>>>>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts 
>>>>>>> when
>>>>>>> they install lots of packages in single environment.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cheng Pan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>>
>>>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving
>>>>>>> the desired outcome. There are other ways to do it; I can't think of any
>>>>>>> other Python package that works that way. This thread is conflating
>>>>>>> different things.
>>>>>>>
>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>> years? --
>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does 
>>>>>>> not
>>>>>>> work with any other version at all" which is likely more incorrect and 
>>>>>>> more
>>>>>>> problematic for users.
>>>>>>>
>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>> resolved
>>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>>> It's
>>>>>>> what env lock flies are for.
>>>>>>>
>>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a
>>>>>>> default when there is nothing more specific known. That is, write
>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>
>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>>>> because Maven resolution is just pretty different, but mostly because 
>>>>>>> the
>>>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat
>>>>>>> jar', a sort of statically-compiled artifact that simply has some 
>>>>>>> specific
>>>>>>> versions in them and can never have different versions because of 
>>>>>>> runtime
>>>>>>> resolution differences.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I agree that a product must be usable first. Pinning the version
>>>>>>>> (to a specific number with `==`) will make pyspark unusable.
>>>>>>>>
>>>>>>>> First of all, I think we can agree that many users use PySpark with
>>>>>>>> other Python packages. If we conflict with other packages, `pip 
>>>>>>>> install -r
>>>>>>>> requirements.txt` won't work. It will complain that the dependencies 
>>>>>>>> can't
>>>>>>>> be resolved, which completely breaks our user's workflow. Even if the 
>>>>>>>> user
>>>>>>>> locks the dependency version, it won't work. So the user had to install
>>>>>>>> PySpark first, then the other packages, to override PySpark's 
>>>>>>>> dependency.
>>>>>>>> They can't put their dependency list in a single file - that is a 
>>>>>>>> horrible
>>>>>>>> user experience.
>>>>>>>>
>>>>>>>> When I look at controversial topics, I always have a strong belief,
>>>>>>>> that I can't be the only smart person in the world. If an idea is good,
>>>>>>>> others must already be doing it. Can we find any recognized package in 
>>>>>>>> the
>>>>>>>> market that pins its dependencies to a specific version? The only case 
>>>>>>>> it
>>>>>>>> works is when this package is *all* the user needs. That's why we pin
>>>>>>>> versions for docker images, HTTP services, or standalone tools - users 
>>>>>>>> just
>>>>>>>> need something that works out of the box. If we consider PySpark the
>>>>>>>> dominant package - meaning that if a user employs it, it must be the 
>>>>>>>> most
>>>>>>>> important element in their project and everything else must comply 
>>>>>>>> with it
>>>>>>>> - pinning versions might be viable.
>>>>>>>>
>>>>>>>> I'm not familiar with Java dependency solutions or how users use
>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>>>> community.
>>>>>>>> If we pin to a specific version, we will face significant criticism. 
>>>>>>>> If we
>>>>>>>> must do it, at least don't make it default. Like I said above, I don't 
>>>>>>>> have
>>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need
>>>>>>>> pyspark and no other packages they could use that. But that's extra 
>>>>>>>> effort
>>>>>>>> for maintenance, and we need to think about what's pinned. We have a 
>>>>>>>> lot of
>>>>>>>> pyspark install versions.
>>>>>>>>
>>>>>>>> Tian Gao
>>>>>>>>
>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think the community has already reached consistence to freeze
>>>>>>>>> dependencies in minor release.
>>>>>>>>>
>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>>>>>
>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized
>>>>>>>>> in minor releases.
>>>>>>>>>
>>>>>>>>> I would interpret the proposed dependency policy applies to both
>>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that 
>>>>>>>>> means
>>>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But 
>>>>>>>>> if the
>>>>>>>>> intention is to only apply such a dependency policy to Java/Scala, 
>>>>>>>>> then it
>>>>>>>>> creates a very strange situation - an extremely conservative 
>>>>>>>>> dependency
>>>>>>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>>>>>>> Python.
>>>>>>>>>
>>>>>>>>> To Tian Gao,
>>>>>>>>>
>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make
>>>>>>>>> us more secure - that's my major point.
>>>>>>>>>
>>>>>>>>> Product must be usable first, then security, performance, etc. If
>>>>>>>>> it claims require `foo>=2.0.0`, how do you ensure it is compatible 
>>>>>>>>> with foo
>>>>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures 
>>>>>>>>> occurred
>>>>>>>>> many times, e.g.,[2]. On the contrary, if it claims require 
>>>>>>>>> `foo==2.0.0`,
>>>>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take 
>>>>>>>>> their
>>>>>>>>> own risk to use it with other `foo` versions, for exmaple, if the 
>>>>>>>>> `foo`
>>>>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, 
>>>>>>>>> but this
>>>>>>>>> is not Spark's responsibility, users should assess and assume the 
>>>>>>>>> risk of
>>>>>>>>> incompatibility themselves.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Cheng Pan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Response inline
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>> Pronouns: she/her
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> One possibility would be to make the pinned version optional (eg
>>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for people to
>>>>>>>>>> optionally use with -c?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>>>>>>> possible today for people using modern Python packaging workflows 
>>>>>>>>>> that use
>>>>>>>>>> lock files. In fact, it happens automatically; all transitive 
>>>>>>>>>> dependencies
>>>>>>>>>> are pinned in the lock file, and this is by design.
>>>>>>>>>>
>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where
>>>>>>>>> does this come from?
>>>>>>>>>
>>>>>>>>> The idea here is we provide the versions we used during the
>>>>>>>>> release stage so if folks want a “known safe” initial starting point 
>>>>>>>>> for a
>>>>>>>>> new env they’ve got one.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Furthermore, it is straightforward to add additional restrictions
>>>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the 
>>>>>>>>>> packaging tool
>>>>>>>>>> builds the lock file, it does it with whatever restrictions you want 
>>>>>>>>>> that
>>>>>>>>>> are specific to your project. That could include specific versions or
>>>>>>>>>> version ranges of libraries to exclude, for example.
>>>>>>>>>>
>>>>>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>>>>>> scratch picking these versions, we can make their lives simpler by
>>>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>>>> choose
>>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>>
>>>>>>>>> Also for interactive workloads I more often see a bare
>>>>>>>>> requirements file or even pip installs in nb cells (but this could be
>>>>>>>>> sample bias).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I had to do this, for example, on a personal project that used
>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that
>>>>>>>>>> was generating a lot of log noise
>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>> I pinned the version of grpc in my project file and let the 
>>>>>>>>>> packaging tool
>>>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>>>> restrictions.
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>>
>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to