Holden, you didn’t mention or link to the ticket you filed. This is the ticket I filed about roughly the same issue back in 2020: SPARK-31167 / associated PR <https://github.com/apache/spark/pull/27928>
> On May 18, 2026, at 8:12 PM, Holden Karau <[email protected]> wrote: > > Awesome, I started on one by its super rough so I’ll leave it to you Tian :) > (filed a JIRA so grab the existing JIRA for coordination) > > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com/?q=hk_email> > Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 > <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > Pronouns: she/her > > On Mon, May 18, 2026 at 5:03 PM Tian Gao <[email protected] > <mailto:[email protected]>> wrote: >> I can work on a prototype. My thought is that we should keep the dependency >> list in `pyproject.toml`. We can have dependency groups for all different >> scenarios (test/dev, minimum/lint/docs etc). Then for generating docker >> images, we include `pyproject.toml` and pip install based on that. I believe >> we can keep the only truth in that file (which is a common way to do things) >> and still be flexible. >> >> On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected] >> <mailto:[email protected]>> wrote: >>> Single source of truth does sound desirable, let me take a look at >>> narrowing that down a bit too. >>> >>> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected] >>> <mailto:[email protected]>> wrote: >>>> We can do either a list of packages from `pip freeze` on our website, or a >>>> `pyspark[pinned]` that has `==`. I'm okay with either (or both). >>>> >>>> If we want to do that, we probably want to pin our package versions on our >>>> stable spark versions. We only partially pin our dependencies for our CI >>>> for maintenance branches, so we do not even have the list now (we may have >>>> it for a certain date, but the list could change any time in the future). >>>> >>>> I think we should come up with a more official CI system so we always test >>>> the released versions (4.0, 4.1 ...) with a pinned versions of packages >>>> (which are the "known working dependencies"), and be more relaxed for dev >>>> branches (4.x, master) because we need to test against new releases for >>>> our dependencies. >>>> >>>> More importantly, it would be really nice to have a single source of >>>> truth. We have to many places to pin the python dependency versions. >>>> >>>> Tian >>>> >>>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected] >>>> <mailto:[email protected]>> wrote: >>>>> I am at PyCon USA Today and the PyPi head just did a call out to audit >>>>> and pin dependencies because the supply chain attacks are increasing >>>>> hockey stick style. >>>>> >>>>> I think we don’t need to pin just yet but let’s add publishing the >>>>> package versions we built with during CI. >>>>> >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>> I think we should do something in response to the growing supply chain >>>>>> attacks rather than just leaving the problem to users. One alternative >>>>>> we could consider for Python specifically is an install target with >>>>>> upper bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. >>>>>> This wouldn't impact regular use, and seems like it would solve the >>>>>> other problems with publishing lock files, etc. As others have >>>>>> mentioned, this wouldn't *guarantee* security, but it would provide >>>>>> meaningful protection against the worst offenders we've recently seen. >>>>>> >>>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>>> > How about as a compromise, we publish (but don’t lock to) the pip >>>>>>> > freeze outputs of the venvs we use for testing? >>>>>>> >>>>>>> > Where do you propose to publish? Spark website? Maybe in our github >>>>>>> > repo somewhere? >>>>>>> >>>>>>> > I was thinking just in the publisher artifacts directory we already >>>>>>> > do. >>>>>>> >>>>>>> +1, I'm fine with any approach, as long as it provides sufficient info >>>>>>> to let user know which exactly version of dependencies was used for >>>>>>> testing. >>>>>>> >>>>>>> For Java/Scala, we have a script[1] generated dependency list in code >>>>>>> repo, at [2] >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh >>>>>>> [2] >>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3 >>>>>>> >>>>>>> Thanks, >>>>>>> Cheng Pan >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> I was thinking just in the publisher artifacts directory we already do. >>>>>>>> >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> Pronouns: she/her >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> Where do you propose to publish? Spark website? Maybe in our github >>>>>>>>> repo somewhere? For python packages, users rarely look for artifacts >>>>>>>>> (and it's difficult to find). >>>>>>>>> >>>>>>>>> Tian >>>>>>>>> >>>>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock >>>>>>>>>> to) the pip freeze outputs of the venvs we use for testing? >>>>>>>>>> >>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>> Pronouns: she/her >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas >>>>>>>>>> <[email protected] <mailto:[email protected]>> >>>>>>>>>> wrote: >>>>>>>>>>> I think supply chain attacks are a problem, but I don’t think we >>>>>>>>>>> want to be on the hook for a solution here, even if it’s meant just >>>>>>>>>>> for our project. >>>>>>>>>>> >>>>>>>>>>> There are “good enough” approaches available today for Python that >>>>>>>>>>> mitigate most of the risk by excluding recent releases when >>>>>>>>>>> resolving what package versions to install. >>>>>>>>>>> >>>>>>>>>>> uv offers exclude-newer >>>>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip >>>>>>>>>>> offers uploaded-prior-to >>>>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>. >>>>>>>>>>> Poetry has an issue open >>>>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a >>>>>>>>>>> similar feature, plus at least one open PR to close it. >>>>>>>>>>> >>>>>>>>>>> Users concerned about supply chain attacks would probably get >>>>>>>>>>> better results from using these options as compared to installing >>>>>>>>>>> pinned dependencies provided by the projects they use. >>>>>>>>>>> >>>>>>>>>>> Nick >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected] >>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> So I think we can ship it as an optional distribution element >>>>>>>>>>>> (it's literally just another file folks can choose to download/use >>>>>>>>>>>> if they want). >>>>>>>>>>>> >>>>>>>>>>>> Asking users is an idea too, I could put together a survey if we >>>>>>>>>>>> want? >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev >>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, >>>>>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This >>>>>>>>>>>>> is a nit and we don't need to focus on the syntax. >>>>>>>>>>>>> >>>>>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's >>>>>>>>>>>>> what users do in their own projects. It's not part of python >>>>>>>>>>>>> package system. What users do is normally install packages, test >>>>>>>>>>>>> it out, then lock it with either pip or uv - generate a lock file >>>>>>>>>>>>> for all dependencies and use it across their systems. It's not >>>>>>>>>>>>> common for packages to list out a "known working dependency list" >>>>>>>>>>>>> for users. >>>>>>>>>>>>> >>>>>>>>>>>>> However, if we really want to try it out, we can do something >>>>>>>>>>>>> like `pip install pyspark[full-pinned] and install every >>>>>>>>>>>>> dependency pyspark requires with a pinned version. If our user >>>>>>>>>>>>> needs an out-of-box solution they can do that. We can also >>>>>>>>>>>>> collect feedbacks and see the sentiment from users. >>>>>>>>>>>>> >>>>>>>>>>>>> Tian >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected] >>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if >>>>>>>>>>>>>> > a user employs it, it must be the most important element in >>>>>>>>>>>>>> > their project and everything else must comply with it - >>>>>>>>>>>>>> > pinning versions might be viable. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is not always true, but definitely a major case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users >>>>>>>>>>>>>> > use spark with Java >>>>>>>>>>>>>> >>>>>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency >>>>>>>>>>>>>> management. Product declares transitive dependencies with pinned >>>>>>>>>>>>>> version, and the package manager (Maven, SBT, Gradle, etc.) >>>>>>>>>>>>>> picks the most reasonable version based on resolution rules. The >>>>>>>>>>>>>> rules is a little different in Maven, SBT and Gradle, the Maven >>>>>>>>>>>>>> docs[1] explains how it works. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In short, in Java/Scala dependency management, the pinned >>>>>>>>>>>>>> version is more like a suggested version, it's easy to override >>>>>>>>>>>>>> by users. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As Owen pointed out, things are completely different in Python >>>>>>>>>>>>>> world, both pinned version and latest version seems not ideal, >>>>>>>>>>>>>> then >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. pinned version (foo==2.0.0) >>>>>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0) >>>>>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0) >>>>>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo) >>>>>>>>>>>>>> >>>>>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still >>>>>>>>>>>>>> believe we should add a disclaimer that this compatibility only >>>>>>>>>>>>>> holds under the assumption that 3rd-party packages strictly >>>>>>>>>>>>>> adhere to semantic versioning. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>>>>>>> > requirements.txt -- expressing a known good / recommended >>>>>>>>>>>>>> > specific resolved environment. That is _not_ what Python >>>>>>>>>>>>>> > dependency constraints are for. It's what env lock flies are >>>>>>>>>>>>>> > for. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We definitely need such a dependency list in PySpark release, >>>>>>>>>>>>>> it's really important for users to set up a reproducible >>>>>>>>>>>>>> environment after the release several years, and this is also a >>>>>>>>>>>>>> good reference for users who encounter 3rd-party packages bugs, >>>>>>>>>>>>>> or battle with dependency conflicts when they install lots of >>>>>>>>>>>>>> packages in single environment. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Cheng Pan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected] >>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not >>>>>>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I >>>>>>>>>>>>>>> can't think of any other Python package that works that way. >>>>>>>>>>>>>>> This thread is conflating different things. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an >>>>>>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in >>>>>>>>>>>>>>> 10 years? -- expressing "foo==2.0.0" is very likely overly >>>>>>>>>>>>>>> narrow. That says "does not work with any other version at all" >>>>>>>>>>>>>>> which is likely more incorrect and more problematic for users. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, >>>>>>>>>>>>>>> requirements.txt -- expressing a known good / recommended >>>>>>>>>>>>>>> specific resolved environment. That is _not_ what Python >>>>>>>>>>>>>>> dependency constraints are for. It's what env lock flies are >>>>>>>>>>>>>>> for. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To be sure there is an art to figuring out the right dependency >>>>>>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance >>>>>>>>>>>>>>> releases, as a default when there is nothing more specific >>>>>>>>>>>>>>> known. That is, write "foo~=2.0.2" to mean ">=2.0.0 and < 2.1". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, >>>>>>>>>>>>>>> partly because Maven resolution is just pretty different, but >>>>>>>>>>>>>>> mostly because the core Spark distribution is the 'server side' >>>>>>>>>>>>>>> and is necessarily a 'fat jar', a sort of statically-compiled >>>>>>>>>>>>>>> artifact that simply has some specific versions in them and can >>>>>>>>>>>>>>> never have different versions because of runtime resolution >>>>>>>>>>>>>>> differences. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev >>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>>> I agree that a product must be usable first. Pinning the >>>>>>>>>>>>>>>> version (to a specific number with `==`) will make pyspark >>>>>>>>>>>>>>>> unusable. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First of all, I think we can agree that many users use PySpark >>>>>>>>>>>>>>>> with other Python packages. If we conflict with other >>>>>>>>>>>>>>>> packages, `pip install -r requirements.txt` won't work. It >>>>>>>>>>>>>>>> will complain that the dependencies can't be resolved, which >>>>>>>>>>>>>>>> completely breaks our user's workflow. Even if the user locks >>>>>>>>>>>>>>>> the dependency version, it won't work. So the user had to >>>>>>>>>>>>>>>> install PySpark first, then the other packages, to override >>>>>>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in >>>>>>>>>>>>>>>> a single file - that is a horrible user experience. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I look at controversial topics, I always have a strong >>>>>>>>>>>>>>>> belief, that I can't be the only smart person in the world. If >>>>>>>>>>>>>>>> an idea is good, others must already be doing it. Can we find >>>>>>>>>>>>>>>> any recognized package in the market that pins its >>>>>>>>>>>>>>>> dependencies to a specific version? The only case it works is >>>>>>>>>>>>>>>> when this package is *all* the user needs. That's why we pin >>>>>>>>>>>>>>>> versions for docker images, HTTP services, or standalone tools >>>>>>>>>>>>>>>> - users just need something that works out of the box. If we >>>>>>>>>>>>>>>> consider PySpark the dominant package - meaning that if a user >>>>>>>>>>>>>>>> employs it, it must be the most important element in their >>>>>>>>>>>>>>>> project and everything else must comply with it - pinning >>>>>>>>>>>>>>>> versions might be viable. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users >>>>>>>>>>>>>>>> use spark with Java, but I'm familiar with the Python >>>>>>>>>>>>>>>> ecosystem and community. If we pin to a specific version, we >>>>>>>>>>>>>>>> will face significant criticism. If we must do it, at least >>>>>>>>>>>>>>>> don't make it default. Like I said above, I don't have a >>>>>>>>>>>>>>>> strong opinion about having a `pyspark[pinned]` - if users >>>>>>>>>>>>>>>> only need pyspark and no other packages they could use that. >>>>>>>>>>>>>>>> But that's extra effort for maintenance, and we need to think >>>>>>>>>>>>>>>> about what's pinned. We have a lot of pyspark install versions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Tian Gao >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected] >>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>>>> I think the community has already reached consistence to >>>>>>>>>>>>>>>>> freeze dependencies in minor release. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence >>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases: >>>>>>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are >>>>>>>>>>>>>>>>> > minimized in minor releases. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I would interpret the proposed dependency policy applies to >>>>>>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. >>>>>>>>>>>>>>>>> If so, that means PySpark will always use pinned dependencies >>>>>>>>>>>>>>>>> version since 4.3.0. But if the intention is to only apply >>>>>>>>>>>>>>>>> such a dependency policy to Java/Scala, then it creates a >>>>>>>>>>>>>>>>> very strange situation - an extremely conservative dependency >>>>>>>>>>>>>>>>> management strategy for Java/Scala, and an extremely liberal >>>>>>>>>>>>>>>>> one for Python. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> To Tian Gao, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always >>>>>>>>>>>>>>>>> > make us more secure - that's my major point. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Product must be usable first, then security, performance, >>>>>>>>>>>>>>>>> etc. If it claims require `foo>=2.0.0`, how do you ensure it >>>>>>>>>>>>>>>>> is compatible with foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, >>>>>>>>>>>>>>>>> such incompatible failures occurred many times, e.g.,[2]. On >>>>>>>>>>>>>>>>> the contrary, if it claims require `foo==2.0.0`, that means >>>>>>>>>>>>>>>>> it was thoroughly tested with `foo==2.0.0`, and users take >>>>>>>>>>>>>>>>> their own risk to use it with other `foo` versions, for >>>>>>>>>>>>>>>>> exmaple, if the `foo` strictly follow semantic version, it >>>>>>>>>>>>>>>>> should work with `foo<3.0.0`, but this is not Spark's >>>>>>>>>>>>>>>>> responsibility, users should assess and assume the risk of >>>>>>>>>>>>>>>>> incompatibility themselves. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633 >>>>>>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> Cheng Pan >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau >>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Response inline >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>>>>>>>> Fight Health Insurance: >>>>>>>>>>>>>>>>>> https://www.fighthealthinsurance.com/ >>>>>>>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>>>>>>>>> YouTube Live Streams: >>>>>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>>>>>>>>>>>>> Pronouns: she/her >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas >>>>>>>>>>>>>>>>>> <[email protected] >>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau >>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> One possibility would be to make the pinned version >>>>>>>>>>>>>>>>>>>> optional (eg pyspark[pinned]) or publish a separate >>>>>>>>>>>>>>>>>>>> constraints file for people to optionally use with -c? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but >>>>>>>>>>>>>>>>>>> this is possible today for people using modern Python >>>>>>>>>>>>>>>>>>> packaging workflows that use lock files. In fact, it >>>>>>>>>>>>>>>>>>> happens automatically; all transitive dependencies are >>>>>>>>>>>>>>>>>>> pinned in the lock file, and this is by design. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda >>>>>>>>>>>>>>>>>> where does this come from? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The idea here is we provide the versions we used during the >>>>>>>>>>>>>>>>>> release stage so if folks want a “known safe” initial >>>>>>>>>>>>>>>>>> starting point for a new env they’ve got one. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Furthermore, it is straightforward to add additional >>>>>>>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so >>>>>>>>>>>>>>>>>>> that when the packaging tool builds the lock file, it does >>>>>>>>>>>>>>>>>>> it with whatever restrictions you want that are specific to >>>>>>>>>>>>>>>>>>> your project. That could include specific versions or >>>>>>>>>>>>>>>>>>> version ranges of libraries to exclude, for example. >>>>>>>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start >>>>>>>>>>>>>>>>>> from scratch picking these versions, we can make their lives >>>>>>>>>>>>>>>>>> simpler by providing the versions we tested against with a >>>>>>>>>>>>>>>>>> lock file they can choose to use, ignore, or update to their >>>>>>>>>>>>>>>>>> desired versions and include. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also for interactive workloads I more often see a bare >>>>>>>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this >>>>>>>>>>>>>>>>>> could be sample bias). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I had to do this, for example, on a personal project that >>>>>>>>>>>>>>>>>>> used PySpark Connect but which was pulling in a version of >>>>>>>>>>>>>>>>>>> grpc that was generating a lot of log noise >>>>>>>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>. >>>>>>>>>>>>>>>>>>> I pinned the version of grpc in my project file and let >>>>>>>>>>>>>>>>>>> the packaging tool resolve all the requirements across >>>>>>>>>>>>>>>>>>> PySpark Connect and my custom restrictions. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Nick >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>>>>>> Pronouns: she/her >>>>>>>>>>> >>>>>>> >>> >>> >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://www.fighthealthinsurance.com/?q=hk_email> >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> Pronouns: she/her
