Re: [discuss] Pinning PySpark dependencies?

Cheng Pan Sun, 29 Mar 2026 22:29:09 -0700

> If we consider PySpark the dominant package - meaning that if a user employs 
> it, it must be the most important element in their project and everything 
> else must comply with it - pinning versions might be viable.


This is not always true, but definitely a major case.

> I'm not familiar with Java dependency solutions or how users use spark with 
> Java

In Java/Scala, it's rare to use dynamic version for dependency management. 
Product declares transitive dependencies with pinned version, and the package 
manager (Maven, SBT, Gradle, etc.) picks the most reasonable version based on 
resolution rules. The rules is a little different in Maven, SBT and Gradle, the 
Maven docs[1] explains how it works.

In short, in Java/Scala dependency management, the pinned version is more like 
a suggested version, it's easy to override by users.

As Owen pointed out, things are completely different in Python world, both 
pinned version and latest version seems not ideal, then

1. pinned version (foo==2.0.0)
2. allow maintenance releases (foo~=2.0.0)
3. allow minor feature releases (foo>=2.0.0,<3.0.0)
4. latest version (foo>=2.0.0, or foo)

seems 2 or 3 might be an acceptable solution? And, I still believe we should 
add a disclaimer that this compatibility only holds under the assumption that 
3rd-party packages strictly adhere to semantic versioning.

> You can totally produce a sort of 'lock' file -- uv.lock, requirements.txt -- 
> expressing a known good / recommended specific resolved environment. That is 
> _not_ what Python dependency constraints are for. It's what env lock flies 
> are for.

We definitely need such a dependency list in PySpark release, it's really 
important for users to set up a reproducible environment after the release 
several years, and this is also a good reference for users who encounter 
3rd-party packages bugs, or battle with dependency conflicts when they install 
lots of packages in single environment.

[1] 
https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html

Thanks,
Cheng Pan



> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
> 
> TL;DR Tian is more correct, and == pinning versions is not achieving the 
> desired outcome. There are other ways to do it; I can't think of any other 
> Python package that works that way. This thread is conflating different 
> things.
> 
> While expressing dependence on "foo>=2.0.0" indeed can be an overly-broad 
> claim -- do you really think it works with 5.x in 10 years? -- expressing 
> "foo==2.0.0" is very likely overly narrow. That says "does not work with any 
> other version at all" which is likely more incorrect and more problematic for 
> users.
> 
> You can totally produce a sort of 'lock' file -- uv.lock, requirements.txt -- 
> expressing a known good / recommended specific resolved environment. That is 
> _not_ what Python dependency constraints are for. It's what env lock flies 
> are for.
> 
> To be sure there is an art to figuring out the right dependency bounds. A 
> reasonable compromise is to allow maintenance releases, as a default when 
> there is nothing more specific known. That is, write "foo~=2.0.2" to mean 
> ">=2.0.0 and < 2.1".
> 
> The analogy to Scala/Java/Maven land does not quite work, partly because 
> Maven resolution is just pretty different, but mostly because the core Spark 
> distribution is the 'server side' and is necessarily a 'fat jar', a sort of 
> statically-compiled artifact that simply has some specific versions in them 
> and can never have different versions because of runtime resolution 
> differences. 
> 
> 
> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected] 
> <mailto:[email protected]>> wrote:
>> I agree that a product must be usable first. Pinning the version (to a 
>> specific number with `==`) will make pyspark unusable.
>> 
>> First of all, I think we can agree that many users use PySpark with other 
>> Python packages. If we conflict with other packages, `pip install -r 
>> requirements.txt` won't work. It will complain that the dependencies can't 
>> be resolved, which completely breaks our user's workflow. Even if the user 
>> locks the dependency version, it won't work. So the user had to install 
>> PySpark first, then the other packages, to override PySpark's dependency. 
>> They can't put their dependency list in a single file - that is a horrible 
>> user experience.
>> 
>> When I look at controversial topics, I always have a strong belief, that I 
>> can't be the only smart person in the world. If an idea is good, others must 
>> already be doing it. Can we find any recognized package in the market that 
>> pins its dependencies to a specific version? The only case it works is when 
>> this package is *all* the user needs. That's why we pin versions for docker 
>> images, HTTP services, or standalone tools - users just need something that 
>> works out of the box. If we consider PySpark the dominant package - meaning 
>> that if a user employs it, it must be the most important element in their 
>> project and everything else must comply with it - pinning versions might be 
>> viable.
>> 
>> I'm not familiar with Java dependency solutions or how users use spark with 
>> Java, but I'm familiar with the Python ecosystem and community. If we pin to 
>> a specific version, we will face significant criticism. If we must do it, at 
>> least don't make it default. Like I said above, I don't have a strong 
>> opinion about having a `pyspark[pinned]` - if users only need pyspark and no 
>> other packages they could use that. But that's extra effort for maintenance, 
>> and we need to think about what's pinned. We have a lot of pyspark install 
>> versions.
>> 
>> Tian Gao
>> 
>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> I think the community has already reached consistence to freeze 
>>> dependencies in minor release.
>>> 
>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]
>>> 
>>> > Clear rules for changes allowed in minor vs. major releases:
>>> > - Dependencies are frozen and behavioral changes are minimized in minor 
>>> > releases.
>>> 
>>> I would interpret the proposed dependency policy applies to both Java/Scala 
>>> and Python dependency management for Spark. If so, that means PySpark will 
>>> always use pinned dependencies version since 4.3.0. But if the intention is 
>>> to only apply such a dependency policy to Java/Scala, then it creates a 
>>> very strange situation - an extremely conservative dependency management 
>>> strategy for Java/Scala, and an extremely liberal one for Python.
>>> 
>>> To Tian Gao,
>>> 
>>> > Pinning versions is a double-edged sword, it doesn't always make us more 
>>> > secure - that's my major point.
>>> 
>>> Product must be usable first, then security, performance, etc. If it claims 
>>> require `foo>=2.0.0`, how do you ensure it is compatible with foo `2.3.4`, 
>>> `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred many times, 
>>> e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`, that means it 
>>> was thoroughly tested with `foo==2.0.0`, and users take their own risk to 
>>> use it with other `foo` versions, for exmaple, if the `foo` strictly follow 
>>> semantic version, it should work with `foo<3.0.0`, but this is not Spark's 
>>> responsibility, users should assess and assume the risk of incompatibility 
>>> themselves.
>>> 
>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>> [2] https://github.com/apache/spark/pull/52633
>>> 
>>> Thanks,
>>> Cheng Pan
>>> 
>>> 
>>> 
>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Response inline 
>>>> 
>>>> 
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>> 
>>>> 
>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> One possibility would be to make the pinned version optional (eg 
>>>>>> pyspark[pinned]) or publish a separate constraints file for people to 
>>>>>> optionally use with -c?
>>>>> 
>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is possible 
>>>>> today for people using modern Python packaging workflows that use lock 
>>>>> files. In fact, it happens automatically; all transitive dependencies are 
>>>>> pinned in the lock file, and this is by design.
>>>> 
>>>> So for someone installing a fresh venv with uv/pip/or conda where does 
>>>> this come from?
>>>> 
>>>> The idea here is we provide the versions we used during the release stage 
>>>> so if folks want a “known safe” initial starting point for a new env 
>>>> they’ve got one.
>>>>> 
>>>>> Furthermore, it is straightforward to add additional restrictions to your 
>>>>> project spec (i.e. pyproject.toml) so that when the packaging tool builds 
>>>>> the lock file, it does it with whatever restrictions you want that are 
>>>>> specific to your project. That could include specific versions or version 
>>>>> ranges of libraries to exclude, for example.
>>>> Yes, but as it stands we leave it to the end user to start from scratch 
>>>> picking these versions, we can make their lives simpler by providing the 
>>>> versions we tested against with a lock file they can choose to use, 
>>>> ignore, or update to their desired versions and include.
>>>> 
>>>> Also for interactive workloads I more often see a bare requirements file 
>>>> or even pip installs in nb cells (but this could be sample bias).
>>>>> 
>>>>> I had to do this, for example, on a personal project that used PySpark 
>>>>> Connect but which was pulling in a version of grpc that was generating a 
>>>>> lot of log noise 
>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. I 
>>>>> pinned the version of grpc in my project file and let the packaging tool 
>>>>> resolve all the requirements across PySpark Connect and my custom 
>>>>> restrictions.
>>>>> 
>>>>> Nick
>>>>> 
>>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to