TL;DR Tian is more correct, and == pinning versions is not achieving the desired outcome. There are other ways to do it; I can't think of any other Python package that works that way. This thread is conflating different things.
While expressing dependence on "foo>=2.0.0" indeed can be an overly-broad claim -- do you really think it works with 5.x in 10 years? -- expressing "foo==2.0.0" is very likely overly narrow. That says "does not work with any other version at all" which is likely more incorrect and more problematic for users. You can totally produce a sort of 'lock' file -- uv.lock, requirements.txt -- expressing a known good / recommended specific resolved environment. That is _not_ what Python dependency constraints are for. It's what env lock flies are for. To be sure there is an art to figuring out the right dependency bounds. A reasonable compromise is to allow maintenance releases, as a default when there is nothing more specific known. That is, write "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". The analogy to Scala/Java/Maven land does not quite work, partly because Maven resolution is just pretty different, but mostly because the core Spark distribution is the 'server side' and is necessarily a 'fat jar', a sort of statically-compiled artifact that simply has some specific versions in them and can never have different versions because of runtime resolution differences. On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]> wrote: > I agree that a product must be usable first. Pinning the version (to a > specific number with `==`) will make pyspark unusable. > > First of all, I think we can agree that many users use PySpark with other > Python packages. If we conflict with other packages, `pip install -r > requirements.txt` won't work. It will complain that the dependencies can't > be resolved, which completely breaks our user's workflow. Even if the user > locks the dependency version, it won't work. So the user had to install > PySpark first, then the other packages, to override PySpark's dependency. > They can't put their dependency list in a single file - that is a horrible > user experience. > > When I look at controversial topics, I always have a strong belief, that I > can't be the only smart person in the world. If an idea is good, others > must already be doing it. Can we find any recognized package in the market > that pins its dependencies to a specific version? The only case it works is > when this package is *all* the user needs. That's why we pin versions for > docker images, HTTP services, or standalone tools - users just need > something that works out of the box. If we consider PySpark the dominant > package - meaning that if a user employs it, it must be the most important > element in their project and everything else must comply with it - pinning > versions might be viable. > > I'm not familiar with Java dependency solutions or how users use spark > with Java, but I'm familiar with the Python ecosystem and community. If we > pin to a specific version, we will face significant criticism. If we must > do it, at least don't make it default. Like I said above, I don't have a > strong opinion about having a `pyspark[pinned]` - if users only need > pyspark and no other packages they could use that. But that's extra effort > for maintenance, and we need to think about what's pinned. We have a lot of > pyspark install versions. > > Tian Gao > > On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote: > >> I think the community has already reached consistence to freeze >> dependencies in minor release. >> >> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >> >> > Clear rules for changes allowed in minor vs. major releases: >> > - Dependencies are frozen and behavioral changes are minimized in minor >> releases. >> >> I would interpret the proposed dependency policy applies to both >> Java/Scala and Python dependency management for Spark. If so, that means >> PySpark will always use pinned dependencies version since 4.3.0. But if the >> intention is to only apply such a dependency policy to Java/Scala, then it >> creates a very strange situation - an extremely conservative dependency >> management strategy for Java/Scala, and an extremely liberal one for Python. >> >> To Tian Gao, >> >> > Pinning versions is a double-edged sword, it doesn't always make us >> more secure - that's my major point. >> >> Product must be usable first, then security, performance, etc. If it >> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo >> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred >> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`, >> that means it was thoroughly tested with `foo==2.0.0`, and users take their >> own risk to use it with other `foo` versions, for exmaple, if the `foo` >> strictly follow semantic version, it should work with `foo<3.0.0`, but this >> is not Spark's responsibility, users should assess and assume the risk of >> incompatibility themselves. >> >> [1] https://issues.apache.org/jira/browse/SPARK-54633 >> [2] https://github.com/apache/spark/pull/52633 >> >> Thanks, >> Cheng Pan >> >> >> >> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> wrote: >> >> Response inline >> >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >> [email protected]> wrote: >> >>> >>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]> >>> wrote: >>> >>> One possibility would be to make the pinned version optional (eg >>> pyspark[pinned]) or publish a separate constraints file for people to >>> optionally use with -c? >>> >>> >>> Perhaps I am misunderstanding your proposal, Holden, but this is >>> possible today for people using modern Python packaging workflows that use >>> lock files. In fact, it happens automatically; all transitive dependencies >>> are pinned in the lock file, and this is by design. >>> >> So for someone installing a fresh venv with uv/pip/or conda where does >> this come from? >> >> The idea here is we provide the versions we used during the release stage >> so if folks want a “known safe” initial starting point for a new env >> they’ve got one. >> >>> >>> Furthermore, it is straightforward to add additional restrictions to >>> your project spec (i.e. pyproject.toml) so that when the packaging tool >>> builds the lock file, it does it with whatever restrictions you want that >>> are specific to your project. That could include specific versions or >>> version ranges of libraries to exclude, for example. >>> >> Yes, but as it stands we leave it to the end user to start from scratch >> picking these versions, we can make their lives simpler by providing the >> versions we tested against with a lock file they can choose to use, ignore, >> or update to their desired versions and include. >> >> Also for interactive workloads I more often see a bare requirements file >> or even pip installs in nb cells (but this could be sample bias). >> >>> >>> I had to do this, for example, on a personal project that used PySpark >>> Connect but which was pulling in a version of grpc that was generating >>> a lot of log noise >>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. I >>> pinned the version of grpc in my project file and let the packaging tool >>> resolve all the requirements across PySpark Connect and my custom >>> restrictions. >>> >>> Nick >>> >>> >>
