Re: [discuss] Pinning PySpark dependencies?

Holden Karau Mon, 30 Mar 2026 10:04:44 -0700

I hear that. How about as a compromise, we publish (but don’t lock to) the
pip freeze outputs of the venvs we use for testing?


Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her


On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <[email protected]>
wrote:

> I think supply chain attacks are a problem, but I don’t think we want to
> be on the hook for a solution here, even if it’s meant just for our project.
>
> There are “good enough” approaches available today for Python that
> mitigate most of the risk by excluding recent releases when resolving what
> package versions to install.
>
> uv offers exclude-newer
> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip offers
> uploaded-prior-to
> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
> Poetry has an issue open
> <https://github.com/python-poetry/poetry/issues/10646> for a similar
> feature, plus at least one open PR to close it.
>
> Users concerned about supply chain attacks would probably get better
> results from using these options as compared to installing pinned
> dependencies provided by the projects they use.
>
> Nick
>
>
> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> wrote:
>
> So I think we can ship it as an optional distribution element (it's
> literally just another file folks can choose to download/use if they want).
>
> Asking users is an idea too, I could put together a survey if we want?
>
> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]>
> wrote:
>
>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*".
>> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't
>> need to focus on the syntax.
>>
>> I don't believe we can ship pyspark with a env lock file. That's what
>> users do in their own projects. It's not part of python package system.
>> What users do is normally install packages, test it out, then lock it with
>> either pip or uv - generate a lock file for all dependencies and use it
>> across their systems. It's not common for packages to list out a "known
>> working dependency list" for users.
>>
>> However, if we really want to try it out, we can do something like `pip
>> install pyspark[full-pinned] and install every dependency pyspark requires
>> with a pinned version. If our user needs an out-of-box solution they can do
>> that. We can also collect feedbacks and see the sentiment from users.
>>
>> Tian
>>
>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote:
>>
>>> > If we consider PySpark the dominant package - meaning that if a user
>>> employs it, it must be the most important element in their project and
>>> everything else must comply with it - pinning versions might be viable.
>>>
>>> This is not always true, but definitely a major case.
>>>
>>> > I'm not familiar with Java dependency solutions or how users use spark
>>> with Java
>>>
>>> In Java/Scala, it's rare to use dynamic version for dependency
>>> management. Product declares transitive dependencies with pinned version,
>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>> reasonable version based on resolution rules. The rules is a little
>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it works.
>>>
>>> In short, in Java/Scala dependency management, the pinned version is
>>> more like a suggested version, it's easy to override by users.
>>>
>>> As Owen pointed out, things are completely different in Python world,
>>> both pinned version and latest version seems not ideal, then
>>>
>>> 1. pinned version (foo==2.0.0)
>>> 2. allow maintenance releases (foo~=2.0.0)
>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>> 4. latest version (foo>=2.0.0, or foo)
>>>
>>> seems 2 or 3 might be an acceptable solution? And, I still believe we
>>> should add a disclaimer that this compatibility only holds under the
>>> assumption that 3rd-party packages strictly adhere to semantic versioning.
>>>
>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>> requirements.txt -- expressing a known good / recommended specific resolved
>>> environment. That is _not_ what Python dependency constraints are for. It's
>>> what env lock flies are for.
>>>
>>> We definitely need such a dependency list in PySpark release, it's
>>> really important for users to set up a reproducible environment after the
>>> release several years, and this is also a good reference for users who
>>> encounter 3rd-party packages bugs, or battle with dependency conflicts when
>>> they install lots of packages in single environment.
>>>
>>> [1]
>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>>
>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>
>>> TL;DR Tian is more correct, and == pinning versions is not achieving the
>>> desired outcome. There are other ways to do it; I can't think of any other
>>> Python package that works that way. This thread is conflating different
>>> things.
>>>
>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>> overly-broad claim -- do you really think it works with 5.x in 10 years? --
>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does not
>>> work with any other version at all" which is likely more incorrect and more
>>> problematic for users.
>>>
>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>> requirements.txt -- expressing a known good / recommended specific resolved
>>> environment. That is _not_ what Python dependency constraints are for. It's
>>> what env lock flies are for.
>>>
>>> To be sure there is an art to figuring out the right dependency bounds.
>>> A reasonable compromise is to allow maintenance releases, as a default when
>>> there is nothing more specific known. That is, write "foo~=2.0.2" to mean
>>> ">=2.0.0 and < 2.1".
>>>
>>> The analogy to Scala/Java/Maven land does not quite work, partly because
>>> Maven resolution is just pretty different, but mostly because the core
>>> Spark distribution is the 'server side' and is necessarily a 'fat jar', a
>>> sort of statically-compiled artifact that simply has some specific versions
>>> in them and can never have different versions because of runtime resolution
>>> differences.
>>>
>>>
>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]>
>>> wrote:
>>>
>>>> I agree that a product must be usable first. Pinning the version (to a
>>>> specific number with `==`) will make pyspark unusable.
>>>>
>>>> First of all, I think we can agree that many users use PySpark with
>>>> other Python packages. If we conflict with other packages, `pip install -r
>>>> requirements.txt` won't work. It will complain that the dependencies can't
>>>> be resolved, which completely breaks our user's workflow. Even if the user
>>>> locks the dependency version, it won't work. So the user had to install
>>>> PySpark first, then the other packages, to override PySpark's dependency.
>>>> They can't put their dependency list in a single file - that is a horrible
>>>> user experience.
>>>>
>>>> When I look at controversial topics, I always have a strong belief,
>>>> that I can't be the only smart person in the world. If an idea is good,
>>>> others must already be doing it. Can we find any recognized package in the
>>>> market that pins its dependencies to a specific version? The only case it
>>>> works is when this package is *all* the user needs. That's why we pin
>>>> versions for docker images, HTTP services, or standalone tools - users just
>>>> need something that works out of the box. If we consider PySpark the
>>>> dominant package - meaning that if a user employs it, it must be the most
>>>> important element in their project and everything else must comply with it
>>>> - pinning versions might be viable.
>>>>
>>>> I'm not familiar with Java dependency solutions or how users use spark
>>>> with Java, but I'm familiar with the Python ecosystem and community. If we
>>>> pin to a specific version, we will face significant criticism. If we must
>>>> do it, at least don't make it default. Like I said above, I don't have a
>>>> strong opinion about having a `pyspark[pinned]` - if users only need
>>>> pyspark and no other packages they could use that. But that's extra effort
>>>> for maintenance, and we need to think about what's pinned. We have a lot of
>>>> pyspark install versions.
>>>>
>>>> Tian Gao
>>>>
>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote:
>>>>
>>>>> I think the community has already reached consistence to freeze
>>>>> dependencies in minor release.
>>>>>
>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>>
>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>> > - Dependencies are frozen and behavioral changes are minimized in
>>>>> minor releases.
>>>>>
>>>>> I would interpret the proposed dependency policy applies to both
>>>>> Java/Scala and Python dependency management for Spark. If so, that means
>>>>> PySpark will always use pinned dependencies version since 4.3.0. But if 
>>>>> the
>>>>> intention is to only apply such a dependency policy to Java/Scala, then it
>>>>> creates a very strange situation - an extremely conservative dependency
>>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>>> Python.
>>>>>
>>>>> To Tian Gao,
>>>>>
>>>>> > Pinning versions is a double-edged sword, it doesn't always make us
>>>>> more secure - that's my major point.
>>>>>
>>>>> Product must be usable first, then security, performance, etc. If it
>>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo
>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred
>>>>> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`,
>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take 
>>>>> their
>>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo`
>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but 
>>>>> this
>>>>> is not Spark's responsibility, users should assess and assume the risk of
>>>>> incompatibility themselves.
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>>
>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Response inline
>>>>>
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> One possibility would be to make the pinned version optional (eg
>>>>>> pyspark[pinned]) or publish a separate constraints file for people to
>>>>>> optionally use with -c?
>>>>>>
>>>>>>
>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>>> possible today for people using modern Python packaging workflows that 
>>>>>> use
>>>>>> lock files. In fact, it happens automatically; all transitive 
>>>>>> dependencies
>>>>>> are pinned in the lock file, and this is by design.
>>>>>>
>>>>> So for someone installing a fresh venv with uv/pip/or conda where does
>>>>> this come from?
>>>>>
>>>>> The idea here is we provide the versions we used during the release
>>>>> stage so if folks want a “known safe” initial starting point for a new env
>>>>> they’ve got one.
>>>>>
>>>>>>
>>>>>> Furthermore, it is straightforward to add additional restrictions to
>>>>>> your project spec (i.e. pyproject.toml) so that when the packaging tool
>>>>>> builds the lock file, it does it with whatever restrictions you want that
>>>>>> are specific to your project. That could include specific versions or
>>>>>> version ranges of libraries to exclude, for example.
>>>>>>
>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>> scratch picking these versions, we can make their lives simpler by
>>>>> providing the versions we tested against with a lock file they can choose
>>>>> to use, ignore, or update to their desired versions and include.
>>>>>
>>>>> Also for interactive workloads I more often see a bare requirements
>>>>> file or even pip installs in nb cells (but this could be sample bias).
>>>>>
>>>>>>
>>>>>> I had to do this, for example, on a personal project that used
>>>>>> PySpark Connect but which was pulling in a version of grpc that was
>>>>>> generating a lot of log noise
>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>> I pinned the version of grpc in my project file and let the packaging 
>>>>>> tool
>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>> restrictions.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>>
>>>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
>
>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to