Re: [discuss] Pinning PySpark dependencies?

Cheng Pan Sun, 29 Mar 2026 19:12:47 -0700

I think the community has already reached consistence to freeze dependencies in 
minor release.


SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1]

> Clear rules for changes allowed in minor vs. major releases:
> - Dependencies are frozen and behavioral changes are minimized in minor 
> releases.

I would interpret the proposed dependency policy applies to both Java/Scala and 
Python dependency management for Spark. If so, that means PySpark will always 
use pinned dependencies version since 4.3.0. But if the intention is to only 
apply such a dependency policy to Java/Scala, then it creates a very strange 
situation - an extremely conservative dependency management strategy for 
Java/Scala, and an extremely liberal one for Python.

To Tian Gao,

> Pinning versions is a double-edged sword, it doesn't always make us more 
> secure - that's my major point.

Product must be usable first, then security, performance, etc. If it claims 
require `foo>=2.0.0`, how do you ensure it is compatible with foo `2.3.4`, 
`3.x.x`, `4.x.x`? Actually, such incompatible failures occurred many times, 
e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`, that means it was 
thoroughly tested with `foo==2.0.0`, and users take their own risk to use it 
with other `foo` versions, for exmaple, if the `foo` strictly follow semantic 
version, it should work with `foo<3.0.0`, but this is not Spark's 
responsibility, users should assess and assume the risk of incompatibility 
themselves.

[1] https://issues.apache.org/jira/browse/SPARK-54633
[2] https://github.com/apache/spark/pull/52633

Thanks,
Cheng Pan



> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> wrote:
> 
> Response inline 
> 
> 
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
> 
> 
> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <[email protected] 
> <mailto:[email protected]>> wrote:
>> 
>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> One possibility would be to make the pinned version optional (eg 
>>> pyspark[pinned]) or publish a separate constraints file for people to 
>>> optionally use with -c?
>> 
>> Perhaps I am misunderstanding your proposal, Holden, but this is possible 
>> today for people using modern Python packaging workflows that use lock 
>> files. In fact, it happens automatically; all transitive dependencies are 
>> pinned in the lock file, and this is by design.
> 
> So for someone installing a fresh venv with uv/pip/or conda where does this 
> come from?
> 
> The idea here is we provide the versions we used during the release stage so 
> if folks want a “known safe” initial starting point for a new env they’ve got 
> one.
>> 
>> Furthermore, it is straightforward to add additional restrictions to your 
>> project spec (i.e. pyproject.toml) so that when the packaging tool builds 
>> the lock file, it does it with whatever restrictions you want that are 
>> specific to your project. That could include specific versions or version 
>> ranges of libraries to exclude, for example.
> Yes, but as it stands we leave it to the end user to start from scratch 
> picking these versions, we can make their lives simpler by providing the 
> versions we tested against with a lock file they can choose to use, ignore, 
> or update to their desired versions and include.
> 
> Also for interactive workloads I more often see a bare requirements file or 
> even pip installs in nb cells (but this could be sample bias).
>> 
>> I had to do this, for example, on a personal project that used PySpark 
>> Connect but which was pulling in a version of grpc that was generating a lot 
>> of log noise 
>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. I 
>> pinned the version of grpc in my project file and let the packaging tool 
>> resolve all the requirements across PySpark Connect and my custom 
>> restrictions.
>> 
>> Nick
>>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to