So I think we can ship it as an optional distribution element (it's literally just another file folks can choose to download/use if they want).
Asking users is an idea too, I could put together a survey if we want? On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <[email protected]> wrote: > I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, foo==2.0.*". > Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit and we don't > need to focus on the syntax. > > I don't believe we can ship pyspark with a env lock file. That's what > users do in their own projects. It's not part of python package system. > What users do is normally install packages, test it out, then lock it with > either pip or uv - generate a lock file for all dependencies and use it > across their systems. It's not common for packages to list out a "known > working dependency list" for users. > > However, if we really want to try it out, we can do something like `pip > install pyspark[full-pinned] and install every dependency pyspark requires > with a pinned version. If our user needs an out-of-box solution they can do > that. We can also collect feedbacks and see the sentiment from users. > > Tian > > On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote: > >> > If we consider PySpark the dominant package - meaning that if a user >> employs it, it must be the most important element in their project and >> everything else must comply with it - pinning versions might be viable. >> >> This is not always true, but definitely a major case. >> >> > I'm not familiar with Java dependency solutions or how users use spark >> with Java >> >> In Java/Scala, it's rare to use dynamic version for dependency >> management. Product declares transitive dependencies with pinned version, >> and the package manager (Maven, SBT, Gradle, etc.) picks the most >> reasonable version based on resolution rules. The rules is a little >> different in Maven, SBT and Gradle, the Maven docs[1] explains how it works. >> >> In short, in Java/Scala dependency management, the pinned version is more >> like a suggested version, it's easy to override by users. >> >> As Owen pointed out, things are completely different in Python world, >> both pinned version and latest version seems not ideal, then >> >> 1. pinned version (foo==2.0.0) >> 2. allow maintenance releases (foo~=2.0.0) >> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >> 4. latest version (foo>=2.0.0, or foo) >> >> seems 2 or 3 might be an acceptable solution? And, I still believe we >> should add a disclaimer that this compatibility only holds under the >> assumption that 3rd-party packages strictly adhere to semantic versioning. >> >> > You can totally produce a sort of 'lock' file -- uv.lock, >> requirements.txt -- expressing a known good / recommended specific resolved >> environment. That is _not_ what Python dependency constraints are for. It's >> what env lock flies are for. >> >> We definitely need such a dependency list in PySpark release, it's really >> important for users to set up a reproducible environment after the release >> several years, and this is also a good reference for users who encounter >> 3rd-party packages bugs, or battle with dependency conflicts when they >> install lots of packages in single environment. >> >> [1] >> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >> >> Thanks, >> Cheng Pan >> >> >> >> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >> >> TL;DR Tian is more correct, and == pinning versions is not achieving the >> desired outcome. There are other ways to do it; I can't think of any other >> Python package that works that way. This thread is conflating different >> things. >> >> While expressing dependence on "foo>=2.0.0" indeed can be an overly-broad >> claim -- do you really think it works with 5.x in 10 years? -- expressing >> "foo==2.0.0" is very likely overly narrow. That says "does not work with >> any other version at all" which is likely more incorrect and more >> problematic for users. >> >> You can totally produce a sort of 'lock' file -- uv.lock, >> requirements.txt -- expressing a known good / recommended specific resolved >> environment. That is _not_ what Python dependency constraints are for. It's >> what env lock flies are for. >> >> To be sure there is an art to figuring out the right dependency bounds. A >> reasonable compromise is to allow maintenance releases, as a default when >> there is nothing more specific known. That is, write "foo~=2.0.2" to mean >> ">=2.0.0 and < 2.1". >> >> The analogy to Scala/Java/Maven land does not quite work, partly because >> Maven resolution is just pretty different, but mostly because the core >> Spark distribution is the 'server side' and is necessarily a 'fat jar', a >> sort of statically-compiled artifact that simply has some specific versions >> in them and can never have different versions because of runtime resolution >> differences. >> >> >> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <[email protected]> >> wrote: >> >>> I agree that a product must be usable first. Pinning the version (to a >>> specific number with `==`) will make pyspark unusable. >>> >>> First of all, I think we can agree that many users use PySpark with >>> other Python packages. If we conflict with other packages, `pip install -r >>> requirements.txt` won't work. It will complain that the dependencies can't >>> be resolved, which completely breaks our user's workflow. Even if the user >>> locks the dependency version, it won't work. So the user had to install >>> PySpark first, then the other packages, to override PySpark's dependency. >>> They can't put their dependency list in a single file - that is a horrible >>> user experience. >>> >>> When I look at controversial topics, I always have a strong belief, that >>> I can't be the only smart person in the world. If an idea is good, others >>> must already be doing it. Can we find any recognized package in the market >>> that pins its dependencies to a specific version? The only case it works is >>> when this package is *all* the user needs. That's why we pin versions for >>> docker images, HTTP services, or standalone tools - users just need >>> something that works out of the box. If we consider PySpark the dominant >>> package - meaning that if a user employs it, it must be the most important >>> element in their project and everything else must comply with it - pinning >>> versions might be viable. >>> >>> I'm not familiar with Java dependency solutions or how users use spark >>> with Java, but I'm familiar with the Python ecosystem and community. If we >>> pin to a specific version, we will face significant criticism. If we must >>> do it, at least don't make it default. Like I said above, I don't have a >>> strong opinion about having a `pyspark[pinned]` - if users only need >>> pyspark and no other packages they could use that. But that's extra effort >>> for maintenance, and we need to think about what's pinned. We have a lot of >>> pyspark install versions. >>> >>> Tian Gao >>> >>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> wrote: >>> >>>> I think the community has already reached consistence to freeze >>>> dependencies in minor release. >>>> >>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>> >>>> > Clear rules for changes allowed in minor vs. major releases: >>>> > - Dependencies are frozen and behavioral changes are minimized in >>>> minor releases. >>>> >>>> I would interpret the proposed dependency policy applies to both >>>> Java/Scala and Python dependency management for Spark. If so, that means >>>> PySpark will always use pinned dependencies version since 4.3.0. But if the >>>> intention is to only apply such a dependency policy to Java/Scala, then it >>>> creates a very strange situation - an extremely conservative dependency >>>> management strategy for Java/Scala, and an extremely liberal one for >>>> Python. >>>> >>>> To Tian Gao, >>>> >>>> > Pinning versions is a double-edged sword, it doesn't always make us >>>> more secure - that's my major point. >>>> >>>> Product must be usable first, then security, performance, etc. If it >>>> claims require `foo>=2.0.0`, how do you ensure it is compatible with foo >>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures occurred >>>> many times, e.g.,[2]. On the contrary, if it claims require `foo==2.0.0`, >>>> that means it was thoroughly tested with `foo==2.0.0`, and users take their >>>> own risk to use it with other `foo` versions, for exmaple, if the `foo` >>>> strictly follow semantic version, it should work with `foo<3.0.0`, but this >>>> is not Spark's responsibility, users should assess and assume the risk of >>>> incompatibility themselves. >>>> >>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>> [2] https://github.com/apache/spark/pull/52633 >>>> >>>> Thanks, >>>> Cheng Pan >>>> >>>> >>>> >>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> wrote: >>>> >>>> Response inline >>>> >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>> [email protected]> wrote: >>>> >>>>> >>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <[email protected]> >>>>> wrote: >>>>> >>>>> One possibility would be to make the pinned version optional (eg >>>>> pyspark[pinned]) or publish a separate constraints file for people to >>>>> optionally use with -c? >>>>> >>>>> >>>>> Perhaps I am misunderstanding your proposal, Holden, but this is >>>>> possible today for people using modern Python packaging workflows that use >>>>> lock files. In fact, it happens automatically; all transitive dependencies >>>>> are pinned in the lock file, and this is by design. >>>>> >>>> So for someone installing a fresh venv with uv/pip/or conda where does >>>> this come from? >>>> >>>> The idea here is we provide the versions we used during the release >>>> stage so if folks want a “known safe” initial starting point for a new env >>>> they’ve got one. >>>> >>>>> >>>>> Furthermore, it is straightforward to add additional restrictions to >>>>> your project spec (i.e. pyproject.toml) so that when the packaging tool >>>>> builds the lock file, it does it with whatever restrictions you want that >>>>> are specific to your project. That could include specific versions or >>>>> version ranges of libraries to exclude, for example. >>>>> >>>> Yes, but as it stands we leave it to the end user to start from scratch >>>> picking these versions, we can make their lives simpler by providing the >>>> versions we tested against with a lock file they can choose to use, ignore, >>>> or update to their desired versions and include. >>>> >>>> Also for interactive workloads I more often see a bare requirements >>>> file or even pip installs in nb cells (but this could be sample bias). >>>> >>>>> >>>>> I had to do this, for example, on a personal project that used PySpark >>>>> Connect but which was pulling in a version of grpc that was >>>>> generating a lot of log noise >>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>> I pinned the version of grpc in my project file and let the packaging tool >>>>> resolve all the requirements across PySpark Connect and my custom >>>>> restrictions. >>>>> >>>>> Nick >>>>> >>>>> >>>> >> -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
