Re: [discuss] Pinning PySpark dependencies?

Holden Karau Tue, 19 May 2026 07:33:35 -0700

The one I filed was
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-56924  I
did not see
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-31167 since
that’s not about locking dependencies but it does probably make sense to
address at the same time.




Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

On Tue, May 19, 2026 at 6:01 AM Nicholas Chammas <[email protected]>
wrote:

> Holden, you didn’t mention or link to the ticket you filed.
>
> This is the ticket I filed about roughly the same issue back in 2020:
> SPARK-31167 / associated PR <https://github.com/apache/spark/pull/27928>
>
>
>
> On May 18, 2026, at 8:12 PM, Holden Karau <[email protected]> wrote:
>
> Awesome, I started on one by its super rough so I’ll leave it to you Tian
> :) (filed a JIRA so grab the existing JIRA for coordination)
>
>
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>
> On Mon, May 18, 2026 at 5:03 PM Tian Gao <[email protected]> wrote:
>
>> I can work on a prototype. My thought is that we should keep the
>> dependency list in `pyproject.toml`. We can have dependency groups for all
>> different scenarios (test/dev, minimum/lint/docs etc). Then for generating
>> docker images, we include `pyproject.toml` and pip install based on that. I
>> believe we can keep the only truth in that file (which is a common way to
>> do things) and still be flexible.
>>
>> On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected]>
>> wrote:
>>
>>> Single source of truth does sound desirable, let me take a look at
>>> narrowing that down a bit too.
>>>
>>> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected]>
>>> wrote:
>>>
>>>> We can do either a list of packages from `pip freeze` on our website,
>>>> or a `pyspark[pinned]` that has `==`. I'm okay with either (or both).
>>>>
>>>> If we want to do that, we probably want to pin our package versions on
>>>> our stable spark versions. We only partially pin our dependencies for our
>>>> CI for maintenance branches, so we do not even have the list now (we may
>>>> have it for a certain date, but the list could change any time in the
>>>> future).
>>>>
>>>> I think we should come up with a more official CI system so we always
>>>> test the released versions (4.0, 4.1 ...) with a pinned versions of
>>>> packages (which are the "known working dependencies"), and be more relaxed
>>>> for dev branches (4.x, master) because we need to test against new releases
>>>> for our dependencies.
>>>>
>>>> More importantly, it would be really nice to have a single source of
>>>> truth. We have to many places to pin the python dependency versions.
>>>>
>>>> Tian
>>>>
>>>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> I am at PyCon USA Today and the PyPi head just did a call out to audit
>>>>> and pin dependencies because the supply chain attacks are increasing 
>>>>> hockey
>>>>> stick style.
>>>>>
>>>>> I think we don’t need to pin just yet but let’s add publishing the
>>>>> package versions we built with during CI.
>>>>>
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think we should do something in response to the growing supply
>>>>>> chain attacks rather than just leaving the problem to users. One
>>>>>> alternative we could consider for Python specifically is an install 
>>>>>> target
>>>>>> with upper bounded dependencies: `pip install
>>>>>> "pyspark[deps-upper-bounded]"`. This wouldn't impact regular use, and 
>>>>>> seems
>>>>>> like it would solve the other problems with publishing lock files, etc. 
>>>>>> As
>>>>>> others have mentioned, this wouldn't *guarantee* security, but it would
>>>>>> provide meaningful protection against the worst offenders we've recently
>>>>>> seen.
>>>>>>
>>>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:
>>>>>>
>>>>>>> > How about as a compromise, we publish (but don’t lock to) the pip
>>>>>>> freeze outputs of the venvs we use for testing?
>>>>>>>
>>>>>>> > Where do you propose to publish? Spark website? Maybe in our
>>>>>>> github repo somewhere?
>>>>>>>
>>>>>>> > I was thinking just in the publisher artifacts directory we
>>>>>>> already do.
>>>>>>>
>>>>>>> +1, I'm fine with any approach, as long as it provides sufficient
>>>>>>> info to let user know which exactly version of dependencies was used for
>>>>>>> testing.
>>>>>>>
>>>>>>> For Java/Scala, we have a script[1] generated dependency list in
>>>>>>> code repo, at [2]
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>>>>>>> [2]
>>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Cheng Pan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I was thinking just in the publisher artifacts directory we already
>>>>>>> do.
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Where do you propose to publish? Spark website? Maybe in our github
>>>>>>>> repo somewhere? For python packages, users rarely look for artifacts 
>>>>>>>> (and
>>>>>>>> it's difficult to find).
>>>>>>>>
>>>>>>>> Tian
>>>>>>>>
>>>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock
>>>>>>>>> to) the pip freeze outputs of the venvs we use for testing?
>>>>>>>>>
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>> Pronouns: she/her
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I think supply chain attacks are a problem, but I don’t think we
>>>>>>>>>> want to be on the hook for a solution here, even if it’s meant just 
>>>>>>>>>> for our
>>>>>>>>>> project.
>>>>>>>>>>
>>>>>>>>>> There are “good enough” approaches available today for Python
>>>>>>>>>> that mitigate most of the risk by excluding recent releases when 
>>>>>>>>>> resolving
>>>>>>>>>> what package versions to install.
>>>>>>>>>>
>>>>>>>>>> uv offers exclude-newer
>>>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>.
>>>>>>>>>> pip offers uploaded-prior-to
>>>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>>>>>>> Poetry has an issue open
>>>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a
>>>>>>>>>> similar feature, plus at least one open PR to close it.
>>>>>>>>>>
>>>>>>>>>> Users concerned about supply chain attacks would probably get
>>>>>>>>>> better results from using these options as compared to installing 
>>>>>>>>>> pinned
>>>>>>>>>> dependencies provided by the projects they use.
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> So I think we can ship it as an optional distribution element
>>>>>>>>>> (it's literally just another file folks can choose to download/use 
>>>>>>>>>> if they
>>>>>>>>>> want).
>>>>>>>>>>
>>>>>>>>>> Asking users is an idea too, I could put together a survey if we
>>>>>>>>>> want?
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1,
>>>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is 
>>>>>>>>>>> a nit
>>>>>>>>>>> and we don't need to focus on the syntax.
>>>>>>>>>>>
>>>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's
>>>>>>>>>>> what users do in their own projects. It's not part of python package
>>>>>>>>>>> system. What users do is normally install packages, test it out, 
>>>>>>>>>>> then lock
>>>>>>>>>>> it with either pip or uv - generate a lock file for all 
>>>>>>>>>>> dependencies and
>>>>>>>>>>> use it across their systems. It's not common for packages to list 
>>>>>>>>>>> out a
>>>>>>>>>>> "known working dependency list" for users.
>>>>>>>>>>>
>>>>>>>>>>> However, if we really want to try it out, we can do something
>>>>>>>>>>> like `pip install pyspark[full-pinned] and install every dependency 
>>>>>>>>>>> pyspark
>>>>>>>>>>> requires with a pinned version. If our user needs an out-of-box 
>>>>>>>>>>> solution
>>>>>>>>>>> they can do that. We can also collect feedbacks and see the 
>>>>>>>>>>> sentiment from
>>>>>>>>>>> users.
>>>>>>>>>>>
>>>>>>>>>>> Tian
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if
>>>>>>>>>>>> a user employs it, it must be the most important element in their 
>>>>>>>>>>>> project
>>>>>>>>>>>> and everything else must comply with it - pinning versions might 
>>>>>>>>>>>> be viable.
>>>>>>>>>>>>
>>>>>>>>>>>> This is not always true, but definitely a major case.
>>>>>>>>>>>>
>>>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users
>>>>>>>>>>>> use spark with Java
>>>>>>>>>>>>
>>>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>>>>>>> version,
>>>>>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>>>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how 
>>>>>>>>>>>> it works.
>>>>>>>>>>>>
>>>>>>>>>>>> In short, in Java/Scala dependency management, the pinned
>>>>>>>>>>>> version is more like a suggested version, it's easy to override by 
>>>>>>>>>>>> users.
>>>>>>>>>>>>
>>>>>>>>>>>> As Owen pointed out, things are completely different in Python
>>>>>>>>>>>> world, both pinned version and latest version seems not ideal, then
>>>>>>>>>>>>
>>>>>>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>>>>>>
>>>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still
>>>>>>>>>>>> believe we should add a disclaimer that this compatibility only 
>>>>>>>>>>>> holds under
>>>>>>>>>>>> the assumption that 3rd-party packages strictly adhere to semantic
>>>>>>>>>>>> versioning.
>>>>>>>>>>>>
>>>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>>>> resolved
>>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>>>> for. It's
>>>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>>>
>>>>>>>>>>>> We definitely need such a dependency list in PySpark release,
>>>>>>>>>>>> it's really important for users to set up a reproducible 
>>>>>>>>>>>> environment after
>>>>>>>>>>>> the release several years, and this is also a good reference for 
>>>>>>>>>>>> users who
>>>>>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency 
>>>>>>>>>>>> conflicts when
>>>>>>>>>>>> they install lots of packages in single environment.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not
>>>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I 
>>>>>>>>>>>> can't think
>>>>>>>>>>>> of any other Python package that works that way. This thread is 
>>>>>>>>>>>> conflating
>>>>>>>>>>>> different things.
>>>>>>>>>>>>
>>>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>>>>>>> years? --
>>>>>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says 
>>>>>>>>>>>> "does not
>>>>>>>>>>>> work with any other version at all" which is likely more incorrect 
>>>>>>>>>>>> and more
>>>>>>>>>>>> problematic for users.
>>>>>>>>>>>>
>>>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>>>> resolved
>>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>>>> for. It's
>>>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>>>
>>>>>>>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, 
>>>>>>>>>>>> as a
>>>>>>>>>>>> default when there is nothing more specific known. That is, write
>>>>>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>>>>>>
>>>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work,
>>>>>>>>>>>> partly because Maven resolution is just pretty different, but 
>>>>>>>>>>>> mostly
>>>>>>>>>>>> because the core Spark distribution is the 'server side' and is 
>>>>>>>>>>>> necessarily
>>>>>>>>>>>> a 'fat jar', a sort of statically-compiled artifact that simply 
>>>>>>>>>>>> has some
>>>>>>>>>>>> specific versions in them and can never have different versions 
>>>>>>>>>>>> because of
>>>>>>>>>>>> runtime resolution differences.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that a product must be usable first. Pinning the
>>>>>>>>>>>>> version (to a specific number with `==`) will make pyspark 
>>>>>>>>>>>>> unusable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> First of all, I think we can agree that many users use PySpark
>>>>>>>>>>>>> with other Python packages. If we conflict with other packages, 
>>>>>>>>>>>>> `pip
>>>>>>>>>>>>> install -r requirements.txt` won't work. It will complain that the
>>>>>>>>>>>>> dependencies can't be resolved, which completely breaks our user's
>>>>>>>>>>>>> workflow. Even if the user locks the dependency version, it won't 
>>>>>>>>>>>>> work. So
>>>>>>>>>>>>> the user had to install PySpark first, then the other packages, 
>>>>>>>>>>>>> to override
>>>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in a 
>>>>>>>>>>>>> single file
>>>>>>>>>>>>> - that is a horrible user experience.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I look at controversial topics, I always have a strong
>>>>>>>>>>>>> belief, that I can't be the only smart person in the world. If an 
>>>>>>>>>>>>> idea is
>>>>>>>>>>>>> good, others must already be doing it. Can we find any recognized 
>>>>>>>>>>>>> package
>>>>>>>>>>>>> in the market that pins its dependencies to a specific version? 
>>>>>>>>>>>>> The only
>>>>>>>>>>>>> case it works is when this package is *all* the user needs. 
>>>>>>>>>>>>> That's why we
>>>>>>>>>>>>> pin versions for docker images, HTTP services, or standalone 
>>>>>>>>>>>>> tools - users
>>>>>>>>>>>>> just need something that works out of the box. If we consider 
>>>>>>>>>>>>> PySpark the
>>>>>>>>>>>>> dominant package - meaning that if a user employs it, it must be 
>>>>>>>>>>>>> the most
>>>>>>>>>>>>> important element in their project and everything else must 
>>>>>>>>>>>>> comply with it
>>>>>>>>>>>>> - pinning versions might be viable.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users
>>>>>>>>>>>>> use spark with Java, but I'm familiar with the Python ecosystem 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> community. If we pin to a specific version, we will face 
>>>>>>>>>>>>> significant
>>>>>>>>>>>>> criticism. If we must do it, at least don't make it default. Like 
>>>>>>>>>>>>> I said
>>>>>>>>>>>>> above, I don't have a strong opinion about having a 
>>>>>>>>>>>>> `pyspark[pinned]` - if
>>>>>>>>>>>>> users only need pyspark and no other packages they could use 
>>>>>>>>>>>>> that. But
>>>>>>>>>>>>> that's extra effort for maintenance, and we need to think about 
>>>>>>>>>>>>> what's
>>>>>>>>>>>>> pinned. We have a lot of pyspark install versions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tian Gao
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the community has already reached consistence to
>>>>>>>>>>>>>> freeze dependencies in minor release.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are
>>>>>>>>>>>>>> minimized in minor releases.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would interpret the proposed dependency policy applies to
>>>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. If 
>>>>>>>>>>>>>> so, that
>>>>>>>>>>>>>> means PySpark will always use pinned dependencies version since 
>>>>>>>>>>>>>> 4.3.0. But
>>>>>>>>>>>>>> if the intention is to only apply such a dependency policy to 
>>>>>>>>>>>>>> Java/Scala,
>>>>>>>>>>>>>> then it creates a very strange situation - an extremely 
>>>>>>>>>>>>>> conservative
>>>>>>>>>>>>>> dependency management strategy for Java/Scala, and an extremely 
>>>>>>>>>>>>>> liberal one
>>>>>>>>>>>>>> for Python.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To Tian Gao,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always
>>>>>>>>>>>>>> make us more secure - that's my major point.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Product must be usable first, then security, performance,
>>>>>>>>>>>>>> etc. If it claims require `foo>=2.0.0`, how do you ensure it is 
>>>>>>>>>>>>>> compatible
>>>>>>>>>>>>>> with foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible 
>>>>>>>>>>>>>> failures
>>>>>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims 
>>>>>>>>>>>>>> require
>>>>>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with 
>>>>>>>>>>>>>> `foo==2.0.0`, and
>>>>>>>>>>>>>> users take their own risk to use it with other `foo` versions, 
>>>>>>>>>>>>>> for exmaple,
>>>>>>>>>>>>>> if the `foo` strictly follow semantic version, it should work 
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> `foo<3.0.0`, but this is not Spark's responsibility, users 
>>>>>>>>>>>>>> should assess
>>>>>>>>>>>>>> and assume the risk of incompatibility themselves.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Response inline
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>>> YouTube Live Streams:
>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One possibility would be to make the pinned version optional
>>>>>>>>>>>>>>> (eg pyspark[pinned]) or publish a separate constraints file for 
>>>>>>>>>>>>>>> people to
>>>>>>>>>>>>>>> optionally use with -c?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but
>>>>>>>>>>>>>>> this is possible today for people using modern Python packaging 
>>>>>>>>>>>>>>> workflows
>>>>>>>>>>>>>>> that use lock files. In fact, it happens automatically; all 
>>>>>>>>>>>>>>> transitive
>>>>>>>>>>>>>>> dependencies are pinned in the lock file, and this is by design.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda
>>>>>>>>>>>>>> where does this come from?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The idea here is we provide the versions we used during the
>>>>>>>>>>>>>> release stage so if folks want a “known safe” initial starting 
>>>>>>>>>>>>>> point for a
>>>>>>>>>>>>>> new env they’ve got one.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Furthermore, it is straightforward to add additional
>>>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that 
>>>>>>>>>>>>>>> when the
>>>>>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever 
>>>>>>>>>>>>>>> restrictions
>>>>>>>>>>>>>>> you want that are specific to your project. That could include 
>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>> versions or version ranges of libraries to exclude, for example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start
>>>>>>>>>>>>>> from scratch picking these versions, we can make their lives 
>>>>>>>>>>>>>> simpler by
>>>>>>>>>>>>>> providing the versions we tested against with a lock file they 
>>>>>>>>>>>>>> can choose
>>>>>>>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also for interactive workloads I more often see a bare
>>>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this 
>>>>>>>>>>>>>> could be
>>>>>>>>>>>>>> sample bias).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I had to do this, for example, on a personal project that
>>>>>>>>>>>>>>> used PySpark Connect but which was pulling in a version of
>>>>>>>>>>>>>>> grpc that was generating a lot of log noise
>>>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>>>>>> I pinned the version of grpc in my project file and let the 
>>>>>>>>>>>>>>> packaging tool
>>>>>>>>>>>>>>> resolve all the requirements across PySpark Connect and my 
>>>>>>>>>>>>>>> custom
>>>>>>>>>>>>>>> restrictions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>
>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to