Re: [discuss] Pinning PySpark dependencies?

Sean Owen Sun, 29 Mar 2026 20:14:22 -0700

TL;DR Tian is more correct, and == pinning versions is not achieving the
desired outcome. There are other ways to do it; I can't think of any other
Python package that works that way. This thread is conflating different
things.


While expressing dependence on "foo>=2.0.0" indeed can be an overly-broad
claim -- do you really think it works with 5.x in 10 years? -- expressing
"foo==2.0.0" is very likely overly narrow. That says "does not work with
any other version at all" which is likely more incorrect and more
problematic for users.

You can totally produce a sort of 'lock' file -- uv.lock, requirements.txt
-- expressing a known good / recommended specific resolved environment.
That is _not_ what Python dependency constraints are for. It's what env
lock flies are for.

To be sure there is an art to figuring out the right dependency bounds. A
reasonable compromise is to allow maintenance releases, as a default when
there is nothing more specific known. That is, write "foo~=2.0.2" to mean
">=2.0.0 and < 2.1".

The analogy to Scala/Java/Maven land does not quite work, partly because
Maven resolution is just pretty different, but mostly because the core
Spark distribution is the 'server side' and is necessarily a 'fat jar', a
sort of statically-compiled artifact that simply has some specific versions
in them and can never have different versions because of runtime resolution
differences.


On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]>
wrote:

> I agree that a product must be usable first. Pinning the version (to a
> specific number with `==`) will make pyspark unusable.
>
> First of all, I think we can agree that many users use PySpark with other
> Python packages. If we conflict with other packages, `pip install -r
> requirements.txt` won't work. It will complain that the dependencies can't
> be resolved, which completely breaks our user's workflow. Even if the user
> locks the dependency version, it won't work. So the user had to install
> PySpark first, then the other packages, to override PySpark's dependency.
> They can't put their dependency list in a single file - that is a horrible
> user experience.
>
> When I look at controversial topics, I always have a strong belief, that I
> can't be the only smart person in the world. If an idea is good, others
> must already be doing it. Can we find any recognized package in the market
> that pins its dependencies to a specific version? The only case it works is
> when this package is *all* the user needs. That's why we pin versions for
> docker images, HTTP services, or standalone tools - users just need
> something that works out of the box. If we consider PySpark the dominant
> package - meaning that if a user employs it, it must be the most important
> element in their project and everything else must comply with it - pinning
> versions might be viable.
>
> I'm not familiar with Java dependency solutions or how users use spark
> with Java, but I'm familiar with the Python ecosystem and community. If we
> pin to a specific version, we will face significant criticism. If we must
> do it, at least don't make it default. Like I said above, I don't have a
> strong opinion about having a `pyspark[pinned]` - if users only need
> pyspark and no other packages they could use that. But that's extra effort
> for maintenance, and we need to think about what's pinned. We have a lot of
> pyspark install versions.
>
> Tian Gao
>
> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote:
>
>> I think the community has already reached consistence to freeze
>> dependencies in minor release.
>>
>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>
>> > Clear rules for changes allowed in minor vs. major releases:
>> > - Dependencies are frozen and behavioral changes are minimized in minor
>> releases.
>>
>> I would interpret the proposed dependency policy applies to both
>> Java/Scala and Python dependency management for Spark. If so, that means
>> PySpark will always use pinned dependencies version since 4.3.0. But if the
>> intention is to only apply such a dependency policy to Java/Scala, then it
>> creates a very strange situation - an extremely conservative dependency
>> management strategy for Java/Scala, and an extremely liberal one for Python.
>>
>> To Tian Gao,
>>
>> > Pinning versions is a double-edged sword, it doesn't always make us
>> more secure - that's my major point.
>>
>> Product must be usable first, then security, performance, etc. If it
>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo
>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred
>> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`,
>> that means it was thoroughly tested with `foo==2.0.0`, and users take their
>> own risk to use it with other `foo` versions, for exmaple, if the `foo`
>> strictly follow semantic version, it should work with `foo<3.0.0`, but this
>> is not Spark's responsibility, users should assess and assume the risk of
>> incompatibility themselves.
>>
>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>> [2] https://github.com/apache/spark/pull/52633
>>
>> Thanks,
>> Cheng Pan
>>
>>
>>
>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> wrote:
>>
>> Response inline
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>>
>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>> [email protected]> wrote:
>>
>>>
>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]>
>>> wrote:
>>>
>>> One possibility would be to make the pinned version optional (eg
>>> pyspark[pinned]) or publish a separate constraints file for people to
>>> optionally use with -c?
>>>
>>>
>>> Perhaps I am misunderstanding your proposal, Holden, but this is
>>> possible today for people using modern Python packaging workflows that use
>>> lock files. In fact, it happens automatically; all transitive dependencies
>>> are pinned in the lock file, and this is by design.
>>>
>> So for someone installing a fresh venv with uv/pip/or conda where does
>> this come from?
>>
>> The idea here is we provide the versions we used during the release stage
>> so if folks want a “known safe” initial starting point for a new env
>> they’ve got one.
>>
>>>
>>> Furthermore, it is straightforward to add additional restrictions to
>>> your project spec (i.e. pyproject.toml) so that when the packaging tool
>>> builds the lock file, it does it with whatever restrictions you want that
>>> are specific to your project. That could include specific versions or
>>> version ranges of libraries to exclude, for example.
>>>
>> Yes, but as it stands we leave it to the end user to start from scratch
>> picking these versions, we can make their lives simpler by providing the
>> versions we tested against with a lock file they can choose to use, ignore,
>> or update to their desired versions and include.
>>
>> Also for interactive workloads I more often see a bare requirements file
>> or even pip installs in nb cells (but this could be sample bias).
>>
>>>
>>> I had to do this, for example, on a personal project that used PySpark
>>> Connect but which was pulling in a version of grpc that was generating
>>> a lot of log noise
>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. I
>>> pinned the version of grpc in my project file and let the packaging tool
>>> resolve all the requirements across PySpark Connect and my custom
>>> restrictions.
>>>
>>> Nick
>>>
>>>
>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to