Re: [discuss] Pinning PySpark dependencies?

Holden Karau Mon, 30 Mar 2026 00:32:54 -0700

So I think we can ship it as an optional distribution element (it's
literally just another file folks can choose to download/use if they want).


Asking users is an idea too, I could put together a survey if we want?

On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]>
wrote:

> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*".
> Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't
> need to focus on the syntax.
>
> I don't believe we can ship pyspark with a env lock file. That's what
> users do in their own projects. It's not part of python package system.
> What users do is normally install packages, test it out, then lock it with
> either pip or uv - generate a lock file for all dependencies and use it
> across their systems. It's not common for packages to list out a "known
> working dependency list" for users.
>
> However, if we really want to try it out, we can do something like `pip
> install pyspark[full-pinned] and install every dependency pyspark requires
> with a pinned version. If our user needs an out-of-box solution they can do
> that. We can also collect feedbacks and see the sentiment from users.
>
> Tian
>
> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote:
>
>> > If we consider PySpark the dominant package - meaning that if a user
>> employs it, it must be the most important element in their project and
>> everything else must comply with it - pinning versions might be viable.
>>
>> This is not always true, but definitely a major case.
>>
>> > I'm not familiar with Java dependency solutions or how users use spark
>> with Java
>>
>> In Java/Scala, it's rare to use dynamic version for dependency
>> management. Product declares transitive dependencies with pinned version,
>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>> reasonable version based on resolution rules. The rules is a little
>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it works.
>>
>> In short, in Java/Scala dependency management, the pinned version is more
>> like a suggested version, it's easy to override by users.
>>
>> As Owen pointed out, things are completely different in Python world,
>> both pinned version and latest version seems not ideal, then
>>
>> 1. pinned version (foo==2.0.0)
>> 2. allow maintenance releases (foo~=2.0.0)
>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>> 4. latest version (foo>=2.0.0, or foo)
>>
>> seems 2 or 3 might be an acceptable solution? And, I still believe we
>> should add a disclaimer that this compatibility only holds under the
>> assumption that 3rd-party packages strictly adhere to semantic versioning.
>>
>> > You can totally produce a sort of 'lock' file -- uv.lock,
>> requirements.txt -- expressing a known good / recommended specific resolved
>> environment. That is _not_ what Python dependency constraints are for. It's
>> what env lock flies are for.
>>
>> We definitely need such a dependency list in PySpark release, it's really
>> important for users to set up a reproducible environment after the release
>> several years, and this is also a good reference for users who encounter
>> 3rd-party packages bugs, or battle with dependency conflicts when they
>> install lots of packages in single environment.
>>
>> [1]
>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>
>> Thanks,
>> Cheng Pan
>>
>>
>>
>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>
>> TL;DR Tian is more correct, and == pinning versions is not achieving the
>> desired outcome. There are other ways to do it; I can't think of any other
>> Python package that works that way. This thread is conflating different
>> things.
>>
>> While expressing dependence on "foo>=2.0.0" indeed can be an overly-broad
>> claim -- do you really think it works with 5.x in 10 years? -- expressing
>> "foo==2.0.0" is very likely overly narrow. That says "does not work with
>> any other version at all" which is likely more incorrect and more
>> problematic for users.
>>
>> You can totally produce a sort of 'lock' file -- uv.lock,
>> requirements.txt -- expressing a known good / recommended specific resolved
>> environment. That is _not_ what Python dependency constraints are for. It's
>> what env lock flies are for.
>>
>> To be sure there is an art to figuring out the right dependency bounds. A
>> reasonable compromise is to allow maintenance releases, as a default when
>> there is nothing more specific known. That is, write "foo~=2.0.2" to mean
>> ">=2.0.0 and < 2.1".
>>
>> The analogy to Scala/Java/Maven land does not quite work, partly because
>> Maven resolution is just pretty different, but mostly because the core
>> Spark distribution is the 'server side' and is necessarily a 'fat jar', a
>> sort of statically-compiled artifact that simply has some specific versions
>> in them and can never have different versions because of runtime resolution
>> differences.
>>
>>
>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]>
>> wrote:
>>
>>> I agree that a product must be usable first. Pinning the version (to a
>>> specific number with `==`) will make pyspark unusable.
>>>
>>> First of all, I think we can agree that many users use PySpark with
>>> other Python packages. If we conflict with other packages, `pip install -r
>>> requirements.txt` won't work. It will complain that the dependencies can't
>>> be resolved, which completely breaks our user's workflow. Even if the user
>>> locks the dependency version, it won't work. So the user had to install
>>> PySpark first, then the other packages, to override PySpark's dependency.
>>> They can't put their dependency list in a single file - that is a horrible
>>> user experience.
>>>
>>> When I look at controversial topics, I always have a strong belief, that
>>> I can't be the only smart person in the world. If an idea is good, others
>>> must already be doing it. Can we find any recognized package in the market
>>> that pins its dependencies to a specific version? The only case it works is
>>> when this package is *all* the user needs. That's why we pin versions for
>>> docker images, HTTP services, or standalone tools - users just need
>>> something that works out of the box. If we consider PySpark the dominant
>>> package - meaning that if a user employs it, it must be the most important
>>> element in their project and everything else must comply with it - pinning
>>> versions might be viable.
>>>
>>> I'm not familiar with Java dependency solutions or how users use spark
>>> with Java, but I'm familiar with the Python ecosystem and community. If we
>>> pin to a specific version, we will face significant criticism. If we must
>>> do it, at least don't make it default. Like I said above, I don't have a
>>> strong opinion about having a `pyspark[pinned]` - if users only need
>>> pyspark and no other packages they could use that. But that's extra effort
>>> for maintenance, and we need to think about what's pinned. We have a lot of
>>> pyspark install versions.
>>>
>>> Tian Gao
>>>
>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote:
>>>
>>>> I think the community has already reached consistence to freeze
>>>> dependencies in minor release.
>>>>
>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>>>
>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>> > - Dependencies are frozen and behavioral changes are minimized in
>>>> minor releases.
>>>>
>>>> I would interpret the proposed dependency policy applies to both
>>>> Java/Scala and Python dependency management for Spark. If so, that means
>>>> PySpark will always use pinned dependencies version since 4.3.0. But if the
>>>> intention is to only apply such a dependency policy to Java/Scala, then it
>>>> creates a very strange situation - an extremely conservative dependency
>>>> management strategy for Java/Scala, and an extremely liberal one for 
>>>> Python.
>>>>
>>>> To Tian Gao,
>>>>
>>>> > Pinning versions is a double-edged sword, it doesn't always make us
>>>> more secure - that's my major point.
>>>>
>>>> Product must be usable first, then security, performance, etc. If it
>>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo
>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred
>>>> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`,
>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take their
>>>> own risk to use it with other `foo` versions, for exmaple, if the `foo`
>>>> strictly follow semantic version, it should work with `foo<3.0.0`, but this
>>>> is not Spark's responsibility, users should assess and assume the risk of
>>>> incompatibility themselves.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>> [2] https://github.com/apache/spark/pull/52633
>>>>
>>>> Thanks,
>>>> Cheng Pan
>>>>
>>>>
>>>>
>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> wrote:
>>>>
>>>> Response inline
>>>>
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>>
>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>> One possibility would be to make the pinned version optional (eg
>>>>> pyspark[pinned]) or publish a separate constraints file for people to
>>>>> optionally use with -c?
>>>>>
>>>>>
>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>>>> possible today for people using modern Python packaging workflows that use
>>>>> lock files. In fact, it happens automatically; all transitive dependencies
>>>>> are pinned in the lock file, and this is by design.
>>>>>
>>>> So for someone installing a fresh venv with uv/pip/or conda where does
>>>> this come from?
>>>>
>>>> The idea here is we provide the versions we used during the release
>>>> stage so if folks want a “known safe” initial starting point for a new env
>>>> they’ve got one.
>>>>
>>>>>
>>>>> Furthermore, it is straightforward to add additional restrictions to
>>>>> your project spec (i.e. pyproject.toml) so that when the packaging tool
>>>>> builds the lock file, it does it with whatever restrictions you want that
>>>>> are specific to your project. That could include specific versions or
>>>>> version ranges of libraries to exclude, for example.
>>>>>
>>>> Yes, but as it stands we leave it to the end user to start from scratch
>>>> picking these versions, we can make their lives simpler by providing the
>>>> versions we tested against with a lock file they can choose to use, ignore,
>>>> or update to their desired versions and include.
>>>>
>>>> Also for interactive workloads I more often see a bare requirements
>>>> file or even pip installs in nb cells (but this could be sample bias).
>>>>
>>>>>
>>>>> I had to do this, for example, on a personal project that used PySpark
>>>>> Connect but which was pulling in a version of grpc that was
>>>>> generating a lot of log noise
>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>> I pinned the version of grpc in my project file and let the packaging tool
>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>> restrictions.
>>>>>
>>>>> Nick
>>>>>
>>>>>
>>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

Re: [discuss] Pinning PySpark dependencies?

Reply via email to