Single source of truth does sound desirable, let me take a look at narrowing that down a bit too.
On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected]> wrote: > We can do either a list of packages from `pip freeze` on our website, or a > `pyspark[pinned]` that has `==`. I'm okay with either (or both). > > If we want to do that, we probably want to pin our package versions on our > stable spark versions. We only partially pin our dependencies for our CI > for maintenance branches, so we do not even have the list now (we may have > it for a certain date, but the list could change any time in the future). > > I think we should come up with a more official CI system so we always test > the released versions (4.0, 4.1 ...) with a pinned versions of packages > (which are the "known working dependencies"), and be more relaxed for dev > branches (4.x, master) because we need to test against new releases for our > dependencies. > > More importantly, it would be really nice to have a single source of > truth. We have to many places to pin the python dependency versions. > > Tian > > On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]> > wrote: > >> I am at PyCon USA Today and the PyPi head just did a call out to audit >> and pin dependencies because the supply chain attacks are increasing hockey >> stick style. >> >> I think we don’t need to pin just yet but let’s add publishing the >> package versions we built with during CI. >> >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev < >> [email protected]> wrote: >> >>> I think we should do something in response to the growing supply chain >>> attacks rather than just leaving the problem to users. One alternative we >>> could consider for Python specifically is an install target with upper >>> bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This >>> wouldn't impact regular use, and seems like it would solve the other >>> problems with publishing lock files, etc. As others have mentioned, this >>> wouldn't *guarantee* security, but it would provide meaningful protection >>> against the worst offenders we've recently seen. >>> >>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote: >>> >>>> > How about as a compromise, we publish (but don’t lock to) the pip >>>> freeze outputs of the venvs we use for testing? >>>> >>>> > Where do you propose to publish? Spark website? Maybe in our github >>>> repo somewhere? >>>> >>>> > I was thinking just in the publisher artifacts directory we already >>>> do. >>>> >>>> +1, I'm fine with any approach, as long as it provides sufficient info >>>> to let user know which exactly version of dependencies was used for >>>> testing. >>>> >>>> For Java/Scala, we have a script[1] generated dependency list in code >>>> repo, at [2] >>>> >>>> [1] >>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh >>>> [2] >>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3 >>>> >>>> Thanks, >>>> Cheng Pan >>>> >>>> >>>> >>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote: >>>> >>>> I was thinking just in the publisher artifacts directory we already do. >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> >>>> wrote: >>>> >>>>> Where do you propose to publish? Spark website? Maybe in our github >>>>> repo somewhere? For python packages, users rarely look for artifacts (and >>>>> it's difficult to find). >>>>> >>>>> Tian >>>>> >>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]> >>>>> wrote: >>>>> >>>>>> I hear that. How about as a compromise, we publish (but don’t lock >>>>>> to) the pip freeze outputs of the venvs we use for testing? >>>>>> >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> Pronouns: she/her >>>>>> >>>>>> >>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I think supply chain attacks are a problem, but I don’t think we >>>>>>> want to be on the hook for a solution here, even if it’s meant just for >>>>>>> our >>>>>>> project. >>>>>>> >>>>>>> There are “good enough” approaches available today for Python that >>>>>>> mitigate most of the risk by excluding recent releases when resolving >>>>>>> what >>>>>>> package versions to install. >>>>>>> >>>>>>> uv offers exclude-newer >>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip >>>>>>> offers uploaded-prior-to >>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>>>>>> Poetry has an issue open >>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a >>>>>>> similar feature, plus at least one open PR to close it. >>>>>>> >>>>>>> Users concerned about supply chain attacks would probably get better >>>>>>> results from using these options as compared to installing pinned >>>>>>> dependencies provided by the projects they use. >>>>>>> >>>>>>> Nick >>>>>>> >>>>>>> >>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> So I think we can ship it as an optional distribution element (it's >>>>>>> literally just another file folks can choose to download/use if they >>>>>>> want). >>>>>>> >>>>>>> Asking users is an idea too, I could put together a survey if we >>>>>>> want? >>>>>>> >>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, >>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a >>>>>>>> nit >>>>>>>> and we don't need to focus on the syntax. >>>>>>>> >>>>>>>> I don't believe we can ship pyspark with a env lock file. That's >>>>>>>> what users do in their own projects. It's not part of python package >>>>>>>> system. What users do is normally install packages, test it out, then >>>>>>>> lock >>>>>>>> it with either pip or uv - generate a lock file for all dependencies >>>>>>>> and >>>>>>>> use it across their systems. It's not common for packages to list out a >>>>>>>> "known working dependency list" for users. >>>>>>>> >>>>>>>> However, if we really want to try it out, we can do something like >>>>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark >>>>>>>> requires with a pinned version. If our user needs an out-of-box >>>>>>>> solution >>>>>>>> they can do that. We can also collect feedbacks and see the sentiment >>>>>>>> from >>>>>>>> users. >>>>>>>> >>>>>>>> Tian >>>>>>>> >>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> > If we consider PySpark the dominant package - meaning that if a >>>>>>>>> user employs it, it must be the most important element in their >>>>>>>>> project and >>>>>>>>> everything else must comply with it - pinning versions might be >>>>>>>>> viable. >>>>>>>>> >>>>>>>>> This is not always true, but definitely a major case. >>>>>>>>> >>>>>>>>> > I'm not familiar with Java dependency solutions or how users use >>>>>>>>> spark with Java >>>>>>>>> >>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>>>>>> management. Product declares transitive dependencies with pinned >>>>>>>>> version, >>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most >>>>>>>>> reasonable version based on resolution rules. The rules is a little >>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it >>>>>>>>> works. >>>>>>>>> >>>>>>>>> In short, in Java/Scala dependency management, the pinned version >>>>>>>>> is more like a suggested version, it's easy to override by users. >>>>>>>>> >>>>>>>>> As Owen pointed out, things are completely different in Python >>>>>>>>> world, both pinned version and latest version seems not ideal, then >>>>>>>>> >>>>>>>>> 1. pinned version (foo==2.0.0) >>>>>>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>>>>>> 4. latest version (foo>=2.0.0, or foo) >>>>>>>>> >>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe >>>>>>>>> we should add a disclaimer that this compatibility only holds under >>>>>>>>> the >>>>>>>>> assumption that 3rd-party packages strictly adhere to semantic >>>>>>>>> versioning. >>>>>>>>> >>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>>>> resolved >>>>>>>>> environment. That is _not_ what Python dependency constraints are >>>>>>>>> for. It's >>>>>>>>> what env lock flies are for. >>>>>>>>> >>>>>>>>> We definitely need such a dependency list in PySpark release, it's >>>>>>>>> really important for users to set up a reproducible environment after >>>>>>>>> the >>>>>>>>> release several years, and this is also a good reference for users who >>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency >>>>>>>>> conflicts when >>>>>>>>> they install lots of packages in single environment. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Cheng Pan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >>>>>>>>> >>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not >>>>>>>>> achieving the desired outcome. There are other ways to do it; I can't >>>>>>>>> think >>>>>>>>> of any other Python package that works that way. This thread is >>>>>>>>> conflating >>>>>>>>> different things. >>>>>>>>> >>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 >>>>>>>>> years? -- >>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does >>>>>>>>> not >>>>>>>>> work with any other version at all" which is likely more incorrect >>>>>>>>> and more >>>>>>>>> problematic for users. >>>>>>>>> >>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>>>> resolved >>>>>>>>> environment. That is _not_ what Python dependency constraints are >>>>>>>>> for. It's >>>>>>>>> what env lock flies are for. >>>>>>>>> >>>>>>>>> To be sure there is an art to figuring out the right dependency >>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a >>>>>>>>> default when there is nothing more specific known. That is, write >>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>>>>>> >>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly >>>>>>>>> because Maven resolution is just pretty different, but mostly because >>>>>>>>> the >>>>>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat >>>>>>>>> jar', a sort of statically-compiled artifact that simply has some >>>>>>>>> specific >>>>>>>>> versions in them and can never have different versions because of >>>>>>>>> runtime >>>>>>>>> resolution differences. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I agree that a product must be usable first. Pinning the version >>>>>>>>>> (to a specific number with `==`) will make pyspark unusable. >>>>>>>>>> >>>>>>>>>> First of all, I think we can agree that many users use PySpark >>>>>>>>>> with other Python packages. If we conflict with other packages, `pip >>>>>>>>>> install -r requirements.txt` won't work. It will complain that the >>>>>>>>>> dependencies can't be resolved, which completely breaks our user's >>>>>>>>>> workflow. Even if the user locks the dependency version, it won't >>>>>>>>>> work. So >>>>>>>>>> the user had to install PySpark first, then the other packages, to >>>>>>>>>> override >>>>>>>>>> PySpark's dependency. They can't put their dependency list in a >>>>>>>>>> single file >>>>>>>>>> - that is a horrible user experience. >>>>>>>>>> >>>>>>>>>> When I look at controversial topics, I always have a strong >>>>>>>>>> belief, that I can't be the only smart person in the world. If an >>>>>>>>>> idea is >>>>>>>>>> good, others must already be doing it. Can we find any recognized >>>>>>>>>> package >>>>>>>>>> in the market that pins its dependencies to a specific version? The >>>>>>>>>> only >>>>>>>>>> case it works is when this package is *all* the user needs. That's >>>>>>>>>> why we >>>>>>>>>> pin versions for docker images, HTTP services, or standalone tools - >>>>>>>>>> users >>>>>>>>>> just need something that works out of the box. If we consider >>>>>>>>>> PySpark the >>>>>>>>>> dominant package - meaning that if a user employs it, it must be the >>>>>>>>>> most >>>>>>>>>> important element in their project and everything else must comply >>>>>>>>>> with it >>>>>>>>>> - pinning versions might be viable. >>>>>>>>>> >>>>>>>>>> I'm not familiar with Java dependency solutions or how users use >>>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and >>>>>>>>>> community. >>>>>>>>>> If we pin to a specific version, we will face significant criticism. >>>>>>>>>> If we >>>>>>>>>> must do it, at least don't make it default. Like I said above, I >>>>>>>>>> don't have >>>>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only >>>>>>>>>> need >>>>>>>>>> pyspark and no other packages they could use that. But that's extra >>>>>>>>>> effort >>>>>>>>>> for maintenance, and we need to think about what's pinned. We have a >>>>>>>>>> lot of >>>>>>>>>> pyspark install versions. >>>>>>>>>> >>>>>>>>>> Tian Gao >>>>>>>>>> >>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I think the community has already reached consistence to freeze >>>>>>>>>>> dependencies in minor release. >>>>>>>>>>> >>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>>>>>>>>> >>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized >>>>>>>>>>> in minor releases. >>>>>>>>>>> >>>>>>>>>>> I would interpret the proposed dependency policy applies to both >>>>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that >>>>>>>>>>> means >>>>>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. >>>>>>>>>>> But if the >>>>>>>>>>> intention is to only apply such a dependency policy to Java/Scala, >>>>>>>>>>> then it >>>>>>>>>>> creates a very strange situation - an extremely conservative >>>>>>>>>>> dependency >>>>>>>>>>> management strategy for Java/Scala, and an extremely liberal one >>>>>>>>>>> for Python. >>>>>>>>>>> >>>>>>>>>>> To Tian Gao, >>>>>>>>>>> >>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always >>>>>>>>>>> make us more secure - that's my major point. >>>>>>>>>>> >>>>>>>>>>> Product must be usable first, then security, performance, etc. >>>>>>>>>>> If it claims require `foo>=2.0.0`, how do you ensure it is >>>>>>>>>>> compatible with >>>>>>>>>>> foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures >>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims require >>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with >>>>>>>>>>> `foo==2.0.0`, and >>>>>>>>>>> users take their own risk to use it with other `foo` versions, for >>>>>>>>>>> exmaple, >>>>>>>>>>> if the `foo` strictly follow semantic version, it should work with >>>>>>>>>>> `foo<3.0.0`, but this is not Spark's responsibility, users should >>>>>>>>>>> assess >>>>>>>>>>> and assume the risk of incompatibility themselves. >>>>>>>>>>> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Cheng Pan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Response inline >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>>> Pronouns: she/her >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> One possibility would be to make the pinned version optional >>>>>>>>>>>> (eg pyspark[pinned]) or publish a separate constraints file for >>>>>>>>>>>> people to >>>>>>>>>>>> optionally use with -c? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this >>>>>>>>>>>> is possible today for people using modern Python packaging >>>>>>>>>>>> workflows that >>>>>>>>>>>> use lock files. In fact, it happens automatically; all transitive >>>>>>>>>>>> dependencies are pinned in the lock file, and this is by design. >>>>>>>>>>>> >>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda >>>>>>>>>>> where does this come from? >>>>>>>>>>> >>>>>>>>>>> The idea here is we provide the versions we used during the >>>>>>>>>>> release stage so if folks want a “known safe” initial starting >>>>>>>>>>> point for a >>>>>>>>>>> new env they’ve got one. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Furthermore, it is straightforward to add additional >>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that >>>>>>>>>>>> when the >>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever >>>>>>>>>>>> restrictions >>>>>>>>>>>> you want that are specific to your project. That could include >>>>>>>>>>>> specific >>>>>>>>>>>> versions or version ranges of libraries to exclude, for example. >>>>>>>>>>>> >>>>>>>>>>> Yes, but as it stands we leave it to the end user to start from >>>>>>>>>>> scratch picking these versions, we can make their lives simpler by >>>>>>>>>>> providing the versions we tested against with a lock file they can >>>>>>>>>>> choose >>>>>>>>>>> to use, ignore, or update to their desired versions and include. >>>>>>>>>>> >>>>>>>>>>> Also for interactive workloads I more often see a bare >>>>>>>>>>> requirements file or even pip installs in nb cells (but this could >>>>>>>>>>> be >>>>>>>>>>> sample bias). >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I had to do this, for example, on a personal project that used >>>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc >>>>>>>>>>>> that was generating a lot of log noise >>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>>>>>> I pinned the version of grpc in my project file and let the >>>>>>>>>>>> packaging tool >>>>>>>>>>>> resolve all the requirements across PySpark Connect and my custom >>>>>>>>>>>> restrictions. >>>>>>>>>>>> >>>>>>>>>>>> Nick >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> Pronouns: she/her >>>>>>> >>>>>>> >>>>>>> >>>> -- Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her
