The one I filed was https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-56924 I did not see https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-31167 since that’s not about locking dependencies but it does probably make sense to address at the same time.
Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau Pronouns: she/her On Tue, May 19, 2026 at 6:01 AM Nicholas Chammas <[email protected]> wrote: > Holden, you didn’t mention or link to the ticket you filed. > > This is the ticket I filed about roughly the same issue back in 2020: > SPARK-31167 / associated PR <https://github.com/apache/spark/pull/27928> > > > > On May 18, 2026, at 8:12 PM, Holden Karau <[email protected]> wrote: > > Awesome, I started on one by its super rough so I’ll leave it to you Tian > :) (filed a JIRA so grab the existing JIRA for coordination) > > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > On Mon, May 18, 2026 at 5:03 PM Tian Gao <[email protected]> wrote: > >> I can work on a prototype. My thought is that we should keep the >> dependency list in `pyproject.toml`. We can have dependency groups for all >> different scenarios (test/dev, minimum/lint/docs etc). Then for generating >> docker images, we include `pyproject.toml` and pip install based on that. I >> believe we can keep the only truth in that file (which is a common way to >> do things) and still be flexible. >> >> On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected]> >> wrote: >> >>> Single source of truth does sound desirable, let me take a look at >>> narrowing that down a bit too. >>> >>> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected]> >>> wrote: >>> >>>> We can do either a list of packages from `pip freeze` on our website, >>>> or a `pyspark[pinned]` that has `==`. I'm okay with either (or both). >>>> >>>> If we want to do that, we probably want to pin our package versions on >>>> our stable spark versions. We only partially pin our dependencies for our >>>> CI for maintenance branches, so we do not even have the list now (we may >>>> have it for a certain date, but the list could change any time in the >>>> future). >>>> >>>> I think we should come up with a more official CI system so we always >>>> test the released versions (4.0, 4.1 ...) with a pinned versions of >>>> packages (which are the "known working dependencies"), and be more relaxed >>>> for dev branches (4.x, master) because we need to test against new releases >>>> for our dependencies. >>>> >>>> More importantly, it would be really nice to have a single source of >>>> truth. We have to many places to pin the python dependency versions. >>>> >>>> Tian >>>> >>>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> I am at PyCon USA Today and the PyPi head just did a call out to audit >>>>> and pin dependencies because the supply chain attacks are increasing >>>>> hockey >>>>> stick style. >>>>> >>>>> I think we don’t need to pin just yet but let’s add publishing the >>>>> package versions we built with during CI. >>>>> >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> I think we should do something in response to the growing supply >>>>>> chain attacks rather than just leaving the problem to users. One >>>>>> alternative we could consider for Python specifically is an install >>>>>> target >>>>>> with upper bounded dependencies: `pip install >>>>>> "pyspark[deps-upper-bounded]"`. This wouldn't impact regular use, and >>>>>> seems >>>>>> like it would solve the other problems with publishing lock files, etc. >>>>>> As >>>>>> others have mentioned, this wouldn't *guarantee* security, but it would >>>>>> provide meaningful protection against the worst offenders we've recently >>>>>> seen. >>>>>> >>>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote: >>>>>> >>>>>>> > How about as a compromise, we publish (but don’t lock to) the pip >>>>>>> freeze outputs of the venvs we use for testing? >>>>>>> >>>>>>> > Where do you propose to publish? Spark website? Maybe in our >>>>>>> github repo somewhere? >>>>>>> >>>>>>> > I was thinking just in the publisher artifacts directory we >>>>>>> already do. >>>>>>> >>>>>>> +1, I'm fine with any approach, as long as it provides sufficient >>>>>>> info to let user know which exactly version of dependencies was used for >>>>>>> testing. >>>>>>> >>>>>>> For Java/Scala, we have a script[1] generated dependency list in >>>>>>> code repo, at [2] >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh >>>>>>> [2] >>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3 >>>>>>> >>>>>>> Thanks, >>>>>>> Cheng Pan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> I was thinking just in the publisher artifacts directory we already >>>>>>> do. >>>>>>> >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> Pronouns: she/her >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Where do you propose to publish? Spark website? Maybe in our github >>>>>>>> repo somewhere? For python packages, users rarely look for artifacts >>>>>>>> (and >>>>>>>> it's difficult to find). >>>>>>>> >>>>>>>> Tian >>>>>>>> >>>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock >>>>>>>>> to) the pip freeze outputs of the venvs we use for testing? >>>>>>>>> >>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>> Pronouns: she/her >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I think supply chain attacks are a problem, but I don’t think we >>>>>>>>>> want to be on the hook for a solution here, even if it’s meant just >>>>>>>>>> for our >>>>>>>>>> project. >>>>>>>>>> >>>>>>>>>> There are “good enough” approaches available today for Python >>>>>>>>>> that mitigate most of the risk by excluding recent releases when >>>>>>>>>> resolving >>>>>>>>>> what package versions to install. >>>>>>>>>> >>>>>>>>>> uv offers exclude-newer >>>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. >>>>>>>>>> pip offers uploaded-prior-to >>>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>>>>>>>>> Poetry has an issue open >>>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a >>>>>>>>>> similar feature, plus at least one open PR to close it. >>>>>>>>>> >>>>>>>>>> Users concerned about supply chain attacks would probably get >>>>>>>>>> better results from using these options as compared to installing >>>>>>>>>> pinned >>>>>>>>>> dependencies provided by the projects they use. >>>>>>>>>> >>>>>>>>>> Nick >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> So I think we can ship it as an optional distribution element >>>>>>>>>> (it's literally just another file folks can choose to download/use >>>>>>>>>> if they >>>>>>>>>> want). >>>>>>>>>> >>>>>>>>>> Asking users is an idea too, I could put together a survey if we >>>>>>>>>> want? >>>>>>>>>> >>>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, >>>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is >>>>>>>>>>> a nit >>>>>>>>>>> and we don't need to focus on the syntax. >>>>>>>>>>> >>>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's >>>>>>>>>>> what users do in their own projects. It's not part of python package >>>>>>>>>>> system. What users do is normally install packages, test it out, >>>>>>>>>>> then lock >>>>>>>>>>> it with either pip or uv - generate a lock file for all >>>>>>>>>>> dependencies and >>>>>>>>>>> use it across their systems. It's not common for packages to list >>>>>>>>>>> out a >>>>>>>>>>> "known working dependency list" for users. >>>>>>>>>>> >>>>>>>>>>> However, if we really want to try it out, we can do something >>>>>>>>>>> like `pip install pyspark[full-pinned] and install every dependency >>>>>>>>>>> pyspark >>>>>>>>>>> requires with a pinned version. If our user needs an out-of-box >>>>>>>>>>> solution >>>>>>>>>>> they can do that. We can also collect feedbacks and see the >>>>>>>>>>> sentiment from >>>>>>>>>>> users. >>>>>>>>>>> >>>>>>>>>>> Tian >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if >>>>>>>>>>>> a user employs it, it must be the most important element in their >>>>>>>>>>>> project >>>>>>>>>>>> and everything else must comply with it - pinning versions might >>>>>>>>>>>> be viable. >>>>>>>>>>>> >>>>>>>>>>>> This is not always true, but definitely a major case. >>>>>>>>>>>> >>>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users >>>>>>>>>>>> use spark with Java >>>>>>>>>>>> >>>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>>>>>>>>> management. Product declares transitive dependencies with pinned >>>>>>>>>>>> version, >>>>>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most >>>>>>>>>>>> reasonable version based on resolution rules. The rules is a little >>>>>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how >>>>>>>>>>>> it works. >>>>>>>>>>>> >>>>>>>>>>>> In short, in Java/Scala dependency management, the pinned >>>>>>>>>>>> version is more like a suggested version, it's easy to override by >>>>>>>>>>>> users. >>>>>>>>>>>> >>>>>>>>>>>> As Owen pointed out, things are completely different in Python >>>>>>>>>>>> world, both pinned version and latest version seems not ideal, then >>>>>>>>>>>> >>>>>>>>>>>> 1. pinned version (foo==2.0.0) >>>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo) >>>>>>>>>>>> >>>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still >>>>>>>>>>>> believe we should add a disclaimer that this compatibility only >>>>>>>>>>>> holds under >>>>>>>>>>>> the assumption that 3rd-party packages strictly adhere to semantic >>>>>>>>>>>> versioning. >>>>>>>>>>>> >>>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>>>>>>> resolved >>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are >>>>>>>>>>>> for. It's >>>>>>>>>>>> what env lock flies are for. >>>>>>>>>>>> >>>>>>>>>>>> We definitely need such a dependency list in PySpark release, >>>>>>>>>>>> it's really important for users to set up a reproducible >>>>>>>>>>>> environment after >>>>>>>>>>>> the release several years, and this is also a good reference for >>>>>>>>>>>> users who >>>>>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency >>>>>>>>>>>> conflicts when >>>>>>>>>>>> they install lots of packages in single environment. >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Cheng Pan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not >>>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I >>>>>>>>>>>> can't think >>>>>>>>>>>> of any other Python package that works that way. This thread is >>>>>>>>>>>> conflating >>>>>>>>>>>> different things. >>>>>>>>>>>> >>>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 >>>>>>>>>>>> years? -- >>>>>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says >>>>>>>>>>>> "does not >>>>>>>>>>>> work with any other version at all" which is likely more incorrect >>>>>>>>>>>> and more >>>>>>>>>>>> problematic for users. >>>>>>>>>>>> >>>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific >>>>>>>>>>>> resolved >>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are >>>>>>>>>>>> for. It's >>>>>>>>>>>> what env lock flies are for. >>>>>>>>>>>> >>>>>>>>>>>> To be sure there is an art to figuring out the right dependency >>>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, >>>>>>>>>>>> as a >>>>>>>>>>>> default when there is nothing more specific known. That is, write >>>>>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>>>>>>>>> >>>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, >>>>>>>>>>>> partly because Maven resolution is just pretty different, but >>>>>>>>>>>> mostly >>>>>>>>>>>> because the core Spark distribution is the 'server side' and is >>>>>>>>>>>> necessarily >>>>>>>>>>>> a 'fat jar', a sort of statically-compiled artifact that simply >>>>>>>>>>>> has some >>>>>>>>>>>> specific versions in them and can never have different versions >>>>>>>>>>>> because of >>>>>>>>>>>> runtime resolution differences. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I agree that a product must be usable first. Pinning the >>>>>>>>>>>>> version (to a specific number with `==`) will make pyspark >>>>>>>>>>>>> unusable. >>>>>>>>>>>>> >>>>>>>>>>>>> First of all, I think we can agree that many users use PySpark >>>>>>>>>>>>> with other Python packages. If we conflict with other packages, >>>>>>>>>>>>> `pip >>>>>>>>>>>>> install -r requirements.txt` won't work. It will complain that the >>>>>>>>>>>>> dependencies can't be resolved, which completely breaks our user's >>>>>>>>>>>>> workflow. Even if the user locks the dependency version, it won't >>>>>>>>>>>>> work. So >>>>>>>>>>>>> the user had to install PySpark first, then the other packages, >>>>>>>>>>>>> to override >>>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in a >>>>>>>>>>>>> single file >>>>>>>>>>>>> - that is a horrible user experience. >>>>>>>>>>>>> >>>>>>>>>>>>> When I look at controversial topics, I always have a strong >>>>>>>>>>>>> belief, that I can't be the only smart person in the world. If an >>>>>>>>>>>>> idea is >>>>>>>>>>>>> good, others must already be doing it. Can we find any recognized >>>>>>>>>>>>> package >>>>>>>>>>>>> in the market that pins its dependencies to a specific version? >>>>>>>>>>>>> The only >>>>>>>>>>>>> case it works is when this package is *all* the user needs. >>>>>>>>>>>>> That's why we >>>>>>>>>>>>> pin versions for docker images, HTTP services, or standalone >>>>>>>>>>>>> tools - users >>>>>>>>>>>>> just need something that works out of the box. If we consider >>>>>>>>>>>>> PySpark the >>>>>>>>>>>>> dominant package - meaning that if a user employs it, it must be >>>>>>>>>>>>> the most >>>>>>>>>>>>> important element in their project and everything else must >>>>>>>>>>>>> comply with it >>>>>>>>>>>>> - pinning versions might be viable. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users >>>>>>>>>>>>> use spark with Java, but I'm familiar with the Python ecosystem >>>>>>>>>>>>> and >>>>>>>>>>>>> community. If we pin to a specific version, we will face >>>>>>>>>>>>> significant >>>>>>>>>>>>> criticism. If we must do it, at least don't make it default. Like >>>>>>>>>>>>> I said >>>>>>>>>>>>> above, I don't have a strong opinion about having a >>>>>>>>>>>>> `pyspark[pinned]` - if >>>>>>>>>>>>> users only need pyspark and no other packages they could use >>>>>>>>>>>>> that. But >>>>>>>>>>>>> that's extra effort for maintenance, and we need to think about >>>>>>>>>>>>> what's >>>>>>>>>>>>> pinned. We have a lot of pyspark install versions. >>>>>>>>>>>>> >>>>>>>>>>>>> Tian Gao >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think the community has already reached consistence to >>>>>>>>>>>>>> freeze dependencies in minor release. >>>>>>>>>>>>>> >>>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are >>>>>>>>>>>>>> minimized in minor releases. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would interpret the proposed dependency policy applies to >>>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. If >>>>>>>>>>>>>> so, that >>>>>>>>>>>>>> means PySpark will always use pinned dependencies version since >>>>>>>>>>>>>> 4.3.0. But >>>>>>>>>>>>>> if the intention is to only apply such a dependency policy to >>>>>>>>>>>>>> Java/Scala, >>>>>>>>>>>>>> then it creates a very strange situation - an extremely >>>>>>>>>>>>>> conservative >>>>>>>>>>>>>> dependency management strategy for Java/Scala, and an extremely >>>>>>>>>>>>>> liberal one >>>>>>>>>>>>>> for Python. >>>>>>>>>>>>>> >>>>>>>>>>>>>> To Tian Gao, >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always >>>>>>>>>>>>>> make us more secure - that's my major point. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Product must be usable first, then security, performance, >>>>>>>>>>>>>> etc. If it claims require `foo>=2.0.0`, how do you ensure it is >>>>>>>>>>>>>> compatible >>>>>>>>>>>>>> with foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible >>>>>>>>>>>>>> failures >>>>>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims >>>>>>>>>>>>>> require >>>>>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with >>>>>>>>>>>>>> `foo==2.0.0`, and >>>>>>>>>>>>>> users take their own risk to use it with other `foo` versions, >>>>>>>>>>>>>> for exmaple, >>>>>>>>>>>>>> if the `foo` strictly follow semantic version, it should work >>>>>>>>>>>>>> with >>>>>>>>>>>>>> `foo<3.0.0`, but this is not Spark's responsibility, users >>>>>>>>>>>>>> should assess >>>>>>>>>>>>>> and assume the risk of incompatibility themselves. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Cheng Pan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Response inline >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>>>>> YouTube Live Streams: >>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>>>>>>>>> Pronouns: she/her >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One possibility would be to make the pinned version optional >>>>>>>>>>>>>>> (eg pyspark[pinned]) or publish a separate constraints file for >>>>>>>>>>>>>>> people to >>>>>>>>>>>>>>> optionally use with -c? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but >>>>>>>>>>>>>>> this is possible today for people using modern Python packaging >>>>>>>>>>>>>>> workflows >>>>>>>>>>>>>>> that use lock files. In fact, it happens automatically; all >>>>>>>>>>>>>>> transitive >>>>>>>>>>>>>>> dependencies are pinned in the lock file, and this is by design. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda >>>>>>>>>>>>>> where does this come from? >>>>>>>>>>>>>> >>>>>>>>>>>>>> The idea here is we provide the versions we used during the >>>>>>>>>>>>>> release stage so if folks want a “known safe” initial starting >>>>>>>>>>>>>> point for a >>>>>>>>>>>>>> new env they’ve got one. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Furthermore, it is straightforward to add additional >>>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that >>>>>>>>>>>>>>> when the >>>>>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever >>>>>>>>>>>>>>> restrictions >>>>>>>>>>>>>>> you want that are specific to your project. That could include >>>>>>>>>>>>>>> specific >>>>>>>>>>>>>>> versions or version ranges of libraries to exclude, for example. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start >>>>>>>>>>>>>> from scratch picking these versions, we can make their lives >>>>>>>>>>>>>> simpler by >>>>>>>>>>>>>> providing the versions we tested against with a lock file they >>>>>>>>>>>>>> can choose >>>>>>>>>>>>>> to use, ignore, or update to their desired versions and include. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also for interactive workloads I more often see a bare >>>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this >>>>>>>>>>>>>> could be >>>>>>>>>>>>>> sample bias). >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I had to do this, for example, on a personal project that >>>>>>>>>>>>>>> used PySpark Connect but which was pulling in a version of >>>>>>>>>>>>>>> grpc that was generating a lot of log noise >>>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>>>>>>>>> I pinned the version of grpc in my project file and let the >>>>>>>>>>>>>>> packaging tool >>>>>>>>>>>>>>> resolve all the requirements across PySpark Connect and my >>>>>>>>>>>>>>> custom >>>>>>>>>>>>>>> restrictions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Nick >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>> Pronouns: she/her >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her >>> >> >
