I am at PyCon USA Today and the PyPi head just did a call out to audit and pin dependencies because the supply chain attacks are increasing hockey stick style.
I think we don’t need to pin just yet but let’s add publishing the package versions we built with during CI. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <[email protected]> wrote: > I think we should do something in response to the growing supply chain > attacks rather than just leaving the problem to users. One alternative we > could consider for Python specifically is an install target with upper > bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This > wouldn't impact regular use, and seems like it would solve the other > problems with publishing lock files, etc. As others have mentioned, this > wouldn't *guarantee* security, but it would provide meaningful protection > against the worst offenders we've recently seen. > > On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote: > >> > How about as a compromise, we publish (but don’t lock to) the pip >> freeze outputs of the venvs we use for testing? >> >> > Where do you propose to publish? Spark website? Maybe in our github >> repo somewhere? >> >> > I was thinking just in the publisher artifacts directory we already do. >> >> +1, I'm fine with any approach, as long as it provides sufficient info to >> let user know which exactly version of dependencies was used for testing. >> >> For Java/Scala, we have a script[1] generated dependency list in code >> repo, at [2] >> >> [1] >> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh >> [2] >> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3 >> >> Thanks, >> Cheng Pan >> >> >> >> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> wrote: >> >> I was thinking just in the publisher artifacts directory we already do. >> >> Twitter: https://twitter.com/holdenkarau >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >> <https://www.fighthealthinsurance.com/?q=hk_email> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> Pronouns: she/her >> >> >> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> >> wrote: >> >>> Where do you propose to publish? Spark website? Maybe in our github repo >>> somewhere? For python packages, users rarely look for artifacts (and it's >>> difficult to find). >>> >>> Tian >>> >>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]> >>> wrote: >>> >>>> I hear that. How about as a compromise, we publish (but don’t lock to) >>>> the pip freeze outputs of the venvs we use for testing? >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> Pronouns: she/her >>>> >>>> >>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas < >>>> [email protected]> wrote: >>>> >>>>> I think supply chain attacks are a problem, but I don’t think we want >>>>> to be on the hook for a solution here, even if it’s meant just for our >>>>> project. >>>>> >>>>> There are “good enough” approaches available today for Python that >>>>> mitigate most of the risk by excluding recent releases when resolving what >>>>> package versions to install. >>>>> >>>>> uv offers exclude-newer >>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip >>>>> offers uploaded-prior-to >>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>>>> Poetry has an issue open >>>>> <https://github.com/python-poetry/poetry/issues/10646> for a similar >>>>> feature, plus at least one open PR to close it. >>>>> >>>>> Users concerned about supply chain attacks would probably get better >>>>> results from using these options as compared to installing pinned >>>>> dependencies provided by the projects they use. >>>>> >>>>> Nick >>>>> >>>>> >>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> >>>>> wrote: >>>>> >>>>> So I think we can ship it as an optional distribution element (it's >>>>> literally just another file folks can choose to download/use if they >>>>> want). >>>>> >>>>> Asking users is an idea too, I could put together a survey if we want? >>>>> >>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, >>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a nit >>>>>> and we don't need to focus on the syntax. >>>>>> >>>>>> I don't believe we can ship pyspark with a env lock file. That's what >>>>>> users do in their own projects. It's not part of python package system. >>>>>> What users do is normally install packages, test it out, then lock it >>>>>> with >>>>>> either pip or uv - generate a lock file for all dependencies and use it >>>>>> across their systems. It's not common for packages to list out a "known >>>>>> working dependency list" for users. >>>>>> >>>>>> However, if we really want to try it out, we can do something like >>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark >>>>>> requires with a pinned version. If our user needs an out-of-box solution >>>>>> they can do that. We can also collect feedbacks and see the sentiment >>>>>> from >>>>>> users. >>>>>> >>>>>> Tian >>>>>> >>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> wrote: >>>>>> >>>>>>> > If we consider PySpark the dominant package - meaning that if a >>>>>>> user employs it, it must be the most important element in their project >>>>>>> and >>>>>>> everything else must comply with it - pinning versions might be viable. >>>>>>> >>>>>>> This is not always true, but definitely a major case. >>>>>>> >>>>>>> > I'm not familiar with Java dependency solutions or how users use >>>>>>> spark with Java >>>>>>> >>>>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>>>> management. Product declares transitive dependencies with pinned >>>>>>> version, >>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most >>>>>>> reasonable version based on resolution rules. The rules is a little >>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how it >>>>>>> works. >>>>>>> >>>>>>> In short, in Java/Scala dependency management, the pinned version is >>>>>>> more like a suggested version, it's easy to override by users. >>>>>>> >>>>>>> As Owen pointed out, things are completely different in Python >>>>>>> world, both pinned version and latest version seems not ideal, then >>>>>>> >>>>>>> 1. pinned version (foo==2.0.0) >>>>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>>>> 4. latest version (foo>=2.0.0, or foo) >>>>>>> >>>>>>> seems 2 or 3 might be an acceptable solution? And, I still believe >>>>>>> we should add a disclaimer that this compatibility only holds under the >>>>>>> assumption that 3rd-party packages strictly adhere to semantic >>>>>>> versioning. >>>>>>> >>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>> resolved >>>>>>> environment. That is _not_ what Python dependency constraints are for. >>>>>>> It's >>>>>>> what env lock flies are for. >>>>>>> >>>>>>> We definitely need such a dependency list in PySpark release, it's >>>>>>> really important for users to set up a reproducible environment after >>>>>>> the >>>>>>> release several years, and this is also a good reference for users who >>>>>>> encounter 3rd-party packages bugs, or battle with dependency conflicts >>>>>>> when >>>>>>> they install lots of packages in single environment. >>>>>>> >>>>>>> [1] >>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>>>> >>>>>>> Thanks, >>>>>>> Cheng Pan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >>>>>>> >>>>>>> TL;DR Tian is more correct, and == pinning versions is not achieving >>>>>>> the desired outcome. There are other ways to do it; I can't think of any >>>>>>> other Python package that works that way. This thread is conflating >>>>>>> different things. >>>>>>> >>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 >>>>>>> years? -- >>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says "does >>>>>>> not >>>>>>> work with any other version at all" which is likely more incorrect and >>>>>>> more >>>>>>> problematic for users. >>>>>>> >>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>> resolved >>>>>>> environment. That is _not_ what Python dependency constraints are for. >>>>>>> It's >>>>>>> what env lock flies are for. >>>>>>> >>>>>>> To be sure there is an art to figuring out the right dependency >>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as a >>>>>>> default when there is nothing more specific known. That is, write >>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>>>> >>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly >>>>>>> because Maven resolution is just pretty different, but mostly because >>>>>>> the >>>>>>> core Spark distribution is the 'server side' and is necessarily a 'fat >>>>>>> jar', a sort of statically-compiled artifact that simply has some >>>>>>> specific >>>>>>> versions in them and can never have different versions because of >>>>>>> runtime >>>>>>> resolution differences. >>>>>>> >>>>>>> >>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I agree that a product must be usable first. Pinning the version >>>>>>>> (to a specific number with `==`) will make pyspark unusable. >>>>>>>> >>>>>>>> First of all, I think we can agree that many users use PySpark with >>>>>>>> other Python packages. If we conflict with other packages, `pip >>>>>>>> install -r >>>>>>>> requirements.txt` won't work. It will complain that the dependencies >>>>>>>> can't >>>>>>>> be resolved, which completely breaks our user's workflow. Even if the >>>>>>>> user >>>>>>>> locks the dependency version, it won't work. So the user had to install >>>>>>>> PySpark first, then the other packages, to override PySpark's >>>>>>>> dependency. >>>>>>>> They can't put their dependency list in a single file - that is a >>>>>>>> horrible >>>>>>>> user experience. >>>>>>>> >>>>>>>> When I look at controversial topics, I always have a strong belief, >>>>>>>> that I can't be the only smart person in the world. If an idea is good, >>>>>>>> others must already be doing it. Can we find any recognized package in >>>>>>>> the >>>>>>>> market that pins its dependencies to a specific version? The only case >>>>>>>> it >>>>>>>> works is when this package is *all* the user needs. That's why we pin >>>>>>>> versions for docker images, HTTP services, or standalone tools - users >>>>>>>> just >>>>>>>> need something that works out of the box. If we consider PySpark the >>>>>>>> dominant package - meaning that if a user employs it, it must be the >>>>>>>> most >>>>>>>> important element in their project and everything else must comply >>>>>>>> with it >>>>>>>> - pinning versions might be viable. >>>>>>>> >>>>>>>> I'm not familiar with Java dependency solutions or how users use >>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and >>>>>>>> community. >>>>>>>> If we pin to a specific version, we will face significant criticism. >>>>>>>> If we >>>>>>>> must do it, at least don't make it default. Like I said above, I don't >>>>>>>> have >>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only need >>>>>>>> pyspark and no other packages they could use that. But that's extra >>>>>>>> effort >>>>>>>> for maintenance, and we need to think about what's pinned. We have a >>>>>>>> lot of >>>>>>>> pyspark install versions. >>>>>>>> >>>>>>>> Tian Gao >>>>>>>> >>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I think the community has already reached consistence to freeze >>>>>>>>> dependencies in minor release. >>>>>>>>> >>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence [1] >>>>>>>>> >>>>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>>>> > - Dependencies are frozen and behavioral changes are minimized >>>>>>>>> in minor releases. >>>>>>>>> >>>>>>>>> I would interpret the proposed dependency policy applies to both >>>>>>>>> Java/Scala and Python dependency management for Spark. If so, that >>>>>>>>> means >>>>>>>>> PySpark will always use pinned dependencies version since 4.3.0. But >>>>>>>>> if the >>>>>>>>> intention is to only apply such a dependency policy to Java/Scala, >>>>>>>>> then it >>>>>>>>> creates a very strange situation - an extremely conservative >>>>>>>>> dependency >>>>>>>>> management strategy for Java/Scala, and an extremely liberal one for >>>>>>>>> Python. >>>>>>>>> >>>>>>>>> To Tian Gao, >>>>>>>>> >>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always make >>>>>>>>> us more secure - that's my major point. >>>>>>>>> >>>>>>>>> Product must be usable first, then security, performance, etc. If >>>>>>>>> it claims require `foo>=2.0.0`, how do you ensure it is compatible >>>>>>>>> with foo >>>>>>>>> `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures >>>>>>>>> occurred >>>>>>>>> many times, e.g.,[2]. On the contrary, if it claims require >>>>>>>>> `foo==2.0.0`, >>>>>>>>> that means it was thoroughly tested with `foo==2.0.0`, and users take >>>>>>>>> their >>>>>>>>> own risk to use it with other `foo` versions, for exmaple, if the >>>>>>>>> `foo` >>>>>>>>> strictly follow semantic version, it should work with `foo<3.0.0`, >>>>>>>>> but this >>>>>>>>> is not Spark's responsibility, users should assess and assume the >>>>>>>>> risk of >>>>>>>>> incompatibility themselves. >>>>>>>>> >>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Cheng Pan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Response inline >>>>>>>>> >>>>>>>>> >>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>> Pronouns: she/her >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> One possibility would be to make the pinned version optional (eg >>>>>>>>>> pyspark[pinned]) or publish a separate constraints file for people to >>>>>>>>>> optionally use with -c? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this is >>>>>>>>>> possible today for people using modern Python packaging workflows >>>>>>>>>> that use >>>>>>>>>> lock files. In fact, it happens automatically; all transitive >>>>>>>>>> dependencies >>>>>>>>>> are pinned in the lock file, and this is by design. >>>>>>>>>> >>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda where >>>>>>>>> does this come from? >>>>>>>>> >>>>>>>>> The idea here is we provide the versions we used during the >>>>>>>>> release stage so if folks want a “known safe” initial starting point >>>>>>>>> for a >>>>>>>>> new env they’ve got one. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Furthermore, it is straightforward to add additional restrictions >>>>>>>>>> to your project spec (i.e. pyproject.toml) so that when the >>>>>>>>>> packaging tool >>>>>>>>>> builds the lock file, it does it with whatever restrictions you want >>>>>>>>>> that >>>>>>>>>> are specific to your project. That could include specific versions or >>>>>>>>>> version ranges of libraries to exclude, for example. >>>>>>>>>> >>>>>>>>> Yes, but as it stands we leave it to the end user to start from >>>>>>>>> scratch picking these versions, we can make their lives simpler by >>>>>>>>> providing the versions we tested against with a lock file they can >>>>>>>>> choose >>>>>>>>> to use, ignore, or update to their desired versions and include. >>>>>>>>> >>>>>>>>> Also for interactive workloads I more often see a bare >>>>>>>>> requirements file or even pip installs in nb cells (but this could be >>>>>>>>> sample bias). >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I had to do this, for example, on a personal project that used >>>>>>>>>> PySpark Connect but which was pulling in a version of grpc that >>>>>>>>>> was generating a lot of log noise >>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>>>> I pinned the version of grpc in my project file and let the >>>>>>>>>> packaging tool >>>>>>>>>> resolve all the requirements across PySpark Connect and my custom >>>>>>>>>> restrictions. >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> >>>>> >>
