Re: [discuss] Pinning PySpark dependencies?

Tian Gao via dev Mon, 18 May 2026 16:30:21 -0700

We can do either a list of packages from `pip freeze` on our website, or a
`pyspark[pinned]` that has `==`. I'm okay with either (or both).


If we want to do that, we probably want to pin our package versions on our
stable spark versions. We only partially pin our dependencies for our CI
for maintenance branches, so we do not even have the list now (we may have
it for a certain date, but the list could change any time in the future).

I think we should come up with a more official CI system so we always test
the released versions (4.0, 4.1 ...) with a pinned versions of packages
(which are the "known working dependencies"), and be more relaxed for dev
branches (4.x, master) because we need to test against new releases for our
dependencies.

More importantly, it would be really nice to have a single source of
truth. We have to many places to pin the python dependency versions.

Tian

On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]> wrote:

> I am at PyCon USA Today and the PyPi head just did a call out to audit and
> pin dependencies because the supply chain attacks are increasing hockey
> stick style.
>
> I think we don’t need to pin just yet but let’s add publishing the package
> versions we built with during CI.
>
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <
> [email protected]> wrote:
>
>> I think we should do something in response to the growing supply chain
>> attacks rather than just leaving the problem to users. One alternative we
>> could consider for Python specifically is an install target with upper
>> bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This
>> wouldn't impact regular use, and seems like it would solve the other
>> problems with publishing lock files, etc. As others have mentioned, this
>> wouldn't *guarantee* security, but it would provide meaningful protection
>> against the worst offenders we've recently seen.
>>
>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:
>>
>>> > How about as a compromise, we publish (but don’t lock to) the pip
>>> freeze outputs of the venvs we use for testing?
>>>
>>> > Where do you propose to publish? Spark website? Maybe in our github
>>> repo somewhere?
>>>
>>> > I was thinking just in the publisher artifacts directory we already do.
>>>
>>> +1, I'm fine with any approach, as long as it provides sufficient info
>>> to let user know which exactly version of dependencies was used for
>>> testing.
>>>
>>> For Java/Scala, we have a script[1] generated dependency list in code
>>> repo, at [2]
>>>
>>> [1]
>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>>> [2]
>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>>
>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote:
>>>
>>> I was thinking just in the publisher artifacts directory we already do.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>>
>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]>
>>> wrote:
>>>
>>>> Where do you propose to publish? Spark website? Maybe in our github
>>>> repo somewhere? For python packages, users rarely look for artifacts (and
>>>> it's difficult to find).
>>>>
>>>> Tian
>>>>
>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> I hear that. How about as a compromise, we publish (but don’t lock to)
>>>>> the pip freeze outputs of the venvs we use for testing?
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think supply chain attacks are a problem, but I don’t think we want
>>>>>> to be on the hook for a solution here, even if it’s meant just for our
>>>>>> project.
>>>>>>
>>>>>> There are “good enough” approaches available today for Python that
>>>>>> mitigate most of the risk by excluding recent releases when resolving 
>>>>>> what
>>>>>> package versions to install.
>>>>>>
>>>>>> uv offers exclude-newer
>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip
>>>>>> offers uploaded-prior-to
>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>>> Poetry has an issue open
>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar
>>>>>> feature, plus at least one open PR to close it.
>>>>>>
>>>>>> Users concerned about supply chain attacks would probably get better
>>>>>> results from using these options as compared to installing pinned
>>>>>> dependencies provided by the projects they use.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> So I think we can ship it as an optional distribution element (it's
>>>>>> literally just another file folks can choose to download/use if they 
>>>>>> want).
>>>>>>
>>>>>> Asking users is an idea too, I could put together a survey if we want?
>>>>>>
>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1,
>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a 
>>>>>>> nit
>>>>>>> and we don't need to focus on the syntax.
>>>>>>>
>>>>>>> I don't believe we can ship pyspark with a env lock file. That's
>>>>>>> what users do in their own projects. It's not part of python package
>>>>>>> system. What users do is normally install packages, test it out, then 
>>>>>>> lock
>>>>>>> it with either pip or uv - generate a lock file for all dependencies and
>>>>>>> use it across their systems. It's not common for packages to list out a
>>>>>>> "known working dependency list" for users.
>>>>>>>
>>>>>>> However, if we really want to try it out, we can do something like
>>>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark
>>>>>>> requires with a pinned version. If our user needs an out-of-box solution
>>>>>>> they can do that. We can also collect feedbacks and see the sentiment 
>>>>>>> from
>>>>>>> users.
>>>>>>>
>>>>>>> Tian
>>>>>>>
>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> > If we consider PySpark the dominant package - meaning that if a
>>>>>>>> user employs it, it must be the most important element in their 
>>>>>>>> project and
>>>>>>>> everything else must comply with it - pinning versions might be viable.
>>>>>>>>
>>>>>>>> This is not always true, but definitely a major case.
>>>>>>>>
>>>>>>>> > I'm not familiar with Java dependency solutions or how users use
>>>>>>>> spark with Java
>>>>>>>>
>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>>> version,
>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it 
>>>>>>>> works.
>>>>>>>>
>>>>>>>> In short, in Java/Scala dependency management, the pinned version
>>>>>>>> is more like a suggested version, it's easy to override by users.
>>>>>>>>
>>>>>>>> As Owen pointed out, things are completely different in Python
>>>>>>>> world, both pinned version and latest version seems not ideal, then
>>>>>>>>
>>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>>
>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe
>>>>>>>> we should add a disclaimer that this compatibility only holds under the
>>>>>>>> assumption that 3rd-party packages strictly adhere to semantic 
>>>>>>>> versioning.
>>>>>>>>
>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>> resolved
>>>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>>>> It's
>>>>>>>> what env lock flies are for.
>>>>>>>>
>>>>>>>> We definitely need such a dependency list in PySpark release, it's
>>>>>>>> really important for users to set up a reproducible environment after 
>>>>>>>> the
>>>>>>>> release several years, and this is also a good reference for users who
>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts 
>>>>>>>> when
>>>>>>>> they install lots of packages in single environment.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Cheng Pan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>>>
>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not
>>>>>>>> achieving the desired outcome. There are other ways to do it; I can't 
>>>>>>>> think
>>>>>>>> of any other Python package that works that way. This thread is 
>>>>>>>> conflating
>>>>>>>> different things.
>>>>>>>>
>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>>> years? --
>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does 
>>>>>>>> not
>>>>>>>> work with any other version at all" which is likely more incorrect and 
>>>>>>>> more
>>>>>>>> problematic for users.
>>>>>>>>
>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>> resolved
>>>>>>>> environment. That is _not_ what Python dependency constraints are for. 
>>>>>>>> It's
>>>>>>>> what env lock flies are for.
>>>>>>>>
>>>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a
>>>>>>>> default when there is nothing more specific known. That is, write
>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>>
>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>>>>> because Maven resolution is just pretty different, but mostly because 
>>>>>>>> the
>>>>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat
>>>>>>>> jar', a sort of statically-compiled artifact that simply has some 
>>>>>>>> specific
>>>>>>>> versions in them and can never have different versions because of 
>>>>>>>> runtime
>>>>>>>> resolution differences.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I agree that a product must be usable first. Pinning the version
>>>>>>>>> (to a specific number with `==`) will make pyspark unusable.
>>>>>>>>>
>>>>>>>>> First of all, I think we can agree that many users use PySpark
>>>>>>>>> with other Python packages. If we conflict with other packages, `pip
>>>>>>>>> install -r requirements.txt` won't work. It will complain that the
>>>>>>>>> dependencies can't be resolved, which completely breaks our user's
>>>>>>>>> workflow. Even if the user locks the dependency version, it won't 
>>>>>>>>> work. So
>>>>>>>>> the user had to install PySpark first, then the other packages, to 
>>>>>>>>> override
>>>>>>>>> PySpark's dependency. They can't put their dependency list in a 
>>>>>>>>> single file
>>>>>>>>> - that is a horrible user experience.
>>>>>>>>>
>>>>>>>>> When I look at controversial topics, I always have a strong
>>>>>>>>> belief, that I can't be the only smart person in the world. If an 
>>>>>>>>> idea is
>>>>>>>>> good, others must already be doing it. Can we find any recognized 
>>>>>>>>> package
>>>>>>>>> in the market that pins its dependencies to a specific version? The 
>>>>>>>>> only
>>>>>>>>> case it works is when this package is *all* the user needs. That's 
>>>>>>>>> why we
>>>>>>>>> pin versions for docker images, HTTP services, or standalone tools - 
>>>>>>>>> users
>>>>>>>>> just need something that works out of the box. If we consider PySpark 
>>>>>>>>> the
>>>>>>>>> dominant package - meaning that if a user employs it, it must be the 
>>>>>>>>> most
>>>>>>>>> important element in their project and everything else must comply 
>>>>>>>>> with it
>>>>>>>>> - pinning versions might be viable.
>>>>>>>>>
>>>>>>>>> I'm not familiar with Java dependency solutions or how users use
>>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>>>>> community.
>>>>>>>>> If we pin to a specific version, we will face significant criticism. 
>>>>>>>>> If we
>>>>>>>>> must do it, at least don't make it default. Like I said above, I 
>>>>>>>>> don't have
>>>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need
>>>>>>>>> pyspark and no other packages they could use that. But that's extra 
>>>>>>>>> effort
>>>>>>>>> for maintenance, and we need to think about what's pinned. We have a 
>>>>>>>>> lot of
>>>>>>>>> pyspark install versions.
>>>>>>>>>
>>>>>>>>> Tian Gao
>>>>>>>>>
>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think the community has already reached consistence to freeze
>>>>>>>>>> dependencies in minor release.
>>>>>>>>>>
>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>>>>>>
>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized
>>>>>>>>>> in minor releases.
>>>>>>>>>>
>>>>>>>>>> I would interpret the proposed dependency policy applies to both
>>>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that 
>>>>>>>>>> means
>>>>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But 
>>>>>>>>>> if the
>>>>>>>>>> intention is to only apply such a dependency policy to Java/Scala, 
>>>>>>>>>> then it
>>>>>>>>>> creates a very strange situation - an extremely conservative 
>>>>>>>>>> dependency
>>>>>>>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>>>>>>>> Python.
>>>>>>>>>>
>>>>>>>>>> To Tian Gao,
>>>>>>>>>>
>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always
>>>>>>>>>> make us more secure - that's my major point.
>>>>>>>>>>
>>>>>>>>>> Product must be usable first, then security, performance, etc. If
>>>>>>>>>> it claims require `foo>=2.0.0`, how do you ensure it is compatible 
>>>>>>>>>> with foo
>>>>>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures 
>>>>>>>>>> occurred
>>>>>>>>>> many times, e.g.,[2]. On the contrary, if it claims require 
>>>>>>>>>> `foo==2.0.0`,
>>>>>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users 
>>>>>>>>>> take their
>>>>>>>>>> own risk to use it with other `foo` versions, for exmaple, if the 
>>>>>>>>>> `foo`
>>>>>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, 
>>>>>>>>>> but this
>>>>>>>>>> is not Spark's responsibility, users should assess and assume the 
>>>>>>>>>> risk of
>>>>>>>>>> incompatibility themselves.
>>>>>>>>>>
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Cheng Pan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Response inline
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> One possibility would be to make the pinned version optional (eg
>>>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for people 
>>>>>>>>>>> to
>>>>>>>>>>> optionally use with -c?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>>>>>>>> possible today for people using modern Python packaging workflows 
>>>>>>>>>>> that use
>>>>>>>>>>> lock files. In fact, it happens automatically; all transitive 
>>>>>>>>>>> dependencies
>>>>>>>>>>> are pinned in the lock file, and this is by design.
>>>>>>>>>>>
>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where
>>>>>>>>>> does this come from?
>>>>>>>>>>
>>>>>>>>>> The idea here is we provide the versions we used during the
>>>>>>>>>> release stage so if folks want a “known safe” initial starting point 
>>>>>>>>>> for a
>>>>>>>>>> new env they’ve got one.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Furthermore, it is straightforward to add additional
>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that 
>>>>>>>>>>> when the
>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever 
>>>>>>>>>>> restrictions
>>>>>>>>>>> you want that are specific to your project. That could include 
>>>>>>>>>>> specific
>>>>>>>>>>> versions or version ranges of libraries to exclude, for example.
>>>>>>>>>>>
>>>>>>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>>>>>>> scratch picking these versions, we can make their lives simpler by
>>>>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>>>>> choose
>>>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>>>
>>>>>>>>>> Also for interactive workloads I more often see a bare
>>>>>>>>>> requirements file or even pip installs in nb cells (but this could be
>>>>>>>>>> sample bias).
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I had to do this, for example, on a personal project that used
>>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that
>>>>>>>>>>> was generating a lot of log noise
>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>> I pinned the version of grpc in my project file and let the 
>>>>>>>>>>> packaging tool
>>>>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>>>>> restrictions.
>>>>>>>>>>>
>>>>>>>>>>> Nick
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>>
>>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to