Re: [discuss] Pinning PySpark dependencies?

Holden Karau Mon, 18 May 2026 17:13:24 -0700

Awesome, I started on one by its super rough so I’ll leave it to you Tian
:) (filed a JIRA so grab the existing JIRA for coordination)



Twitter: https://twitter.com/holdenkarau
Fight Health Insurance: https://www.fighthealthinsurance.com/
<https://www.fighthealthinsurance.com/?q=hk_email>
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Pronouns: she/her

On Mon, May 18, 2026 at 5:03 PM Tian Gao <[email protected]> wrote:

> I can work on a prototype. My thought is that we should keep the
> dependency list in `pyproject.toml`. We can have dependency groups for all
> different scenarios (test/dev, minimum/lint/docs etc). Then for generating
> docker images, we include `pyproject.toml` and pip install based on that. I
> believe we can keep the only truth in that file (which is a common way to
> do things) and still be flexible.
>
> On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected]>
> wrote:
>
>> Single source of truth does sound desirable, let me take a look at
>> narrowing that down a bit too.
>>
>> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected]>
>> wrote:
>>
>>> We can do either a list of packages from `pip freeze` on our website, or
>>> a `pyspark[pinned]` that has `==`. I'm okay with either (or both).
>>>
>>> If we want to do that, we probably want to pin our package versions on
>>> our stable spark versions. We only partially pin our dependencies for our
>>> CI for maintenance branches, so we do not even have the list now (we may
>>> have it for a certain date, but the list could change any time in the
>>> future).
>>>
>>> I think we should come up with a more official CI system so we always
>>> test the released versions (4.0, 4.1 ...) with a pinned versions of
>>> packages (which are the "known working dependencies"), and be more relaxed
>>> for dev branches (4.x, master) because we need to test against new releases
>>> for our dependencies.
>>>
>>> More importantly, it would be really nice to have a single source of
>>> truth. We have to many places to pin the python dependency versions.
>>>
>>> Tian
>>>
>>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> I am at PyCon USA Today and the PyPi head just did a call out to audit
>>>> and pin dependencies because the supply chain attacks are increasing hockey
>>>> stick style.
>>>>
>>>> I think we don’t need to pin just yet but let’s add publishing the
>>>> package versions we built with during CI.
>>>>
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> Pronouns: she/her
>>>>
>>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> I think we should do something in response to the growing supply chain
>>>>> attacks rather than just leaving the problem to users. One alternative we
>>>>> could consider for Python specifically is an install target with upper
>>>>> bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This
>>>>> wouldn't impact regular use, and seems like it would solve the other
>>>>> problems with publishing lock files, etc. As others have mentioned, this
>>>>> wouldn't *guarantee* security, but it would provide meaningful protection
>>>>> against the worst offenders we've recently seen.
>>>>>
>>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:
>>>>>
>>>>>> > How about as a compromise, we publish (but don’t lock to) the pip
>>>>>> freeze outputs of the venvs we use for testing?
>>>>>>
>>>>>> > Where do you propose to publish? Spark website? Maybe in our github
>>>>>> repo somewhere?
>>>>>>
>>>>>> > I was thinking just in the publisher artifacts directory we already
>>>>>> do.
>>>>>>
>>>>>> +1, I'm fine with any approach, as long as it provides sufficient
>>>>>> info to let user know which exactly version of dependencies was used for
>>>>>> testing.
>>>>>>
>>>>>> For Java/Scala, we have a script[1] generated dependency list in code
>>>>>> repo, at [2]
>>>>>>
>>>>>> [1]
>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>>>>>> [2]
>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> I was thinking just in the publisher artifacts directory we already
>>>>>> do.
>>>>>>
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> Pronouns: she/her
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Where do you propose to publish? Spark website? Maybe in our github
>>>>>>> repo somewhere? For python packages, users rarely look for artifacts 
>>>>>>> (and
>>>>>>> it's difficult to find).
>>>>>>>
>>>>>>> Tian
>>>>>>>
>>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock
>>>>>>>> to) the pip freeze outputs of the venvs we use for testing?
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think supply chain attacks are a problem, but I don’t think we
>>>>>>>>> want to be on the hook for a solution here, even if it’s meant just 
>>>>>>>>> for our
>>>>>>>>> project.
>>>>>>>>>
>>>>>>>>> There are “good enough” approaches available today for Python that
>>>>>>>>> mitigate most of the risk by excluding recent releases when resolving 
>>>>>>>>> what
>>>>>>>>> package versions to install.
>>>>>>>>>
>>>>>>>>> uv offers exclude-newer
>>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>.
>>>>>>>>> pip offers uploaded-prior-to
>>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>>>>>> Poetry has an issue open
>>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a
>>>>>>>>> similar feature, plus at least one open PR to close it.
>>>>>>>>>
>>>>>>>>> Users concerned about supply chain attacks would probably get
>>>>>>>>> better results from using these options as compared to installing 
>>>>>>>>> pinned
>>>>>>>>> dependencies provided by the projects they use.
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> So I think we can ship it as an optional distribution element
>>>>>>>>> (it's literally just another file folks can choose to download/use if 
>>>>>>>>> they
>>>>>>>>> want).
>>>>>>>>>
>>>>>>>>> Asking users is an idea too, I could put together a survey if we
>>>>>>>>> want?
>>>>>>>>>
>>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1,
>>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is 
>>>>>>>>>> a nit
>>>>>>>>>> and we don't need to focus on the syntax.
>>>>>>>>>>
>>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's
>>>>>>>>>> what users do in their own projects. It's not part of python package
>>>>>>>>>> system. What users do is normally install packages, test it out, 
>>>>>>>>>> then lock
>>>>>>>>>> it with either pip or uv - generate a lock file for all dependencies 
>>>>>>>>>> and
>>>>>>>>>> use it across their systems. It's not common for packages to list 
>>>>>>>>>> out a
>>>>>>>>>> "known working dependency list" for users.
>>>>>>>>>>
>>>>>>>>>> However, if we really want to try it out, we can do something
>>>>>>>>>> like `pip install pyspark[full-pinned] and install every dependency 
>>>>>>>>>> pyspark
>>>>>>>>>> requires with a pinned version. If our user needs an out-of-box 
>>>>>>>>>> solution
>>>>>>>>>> they can do that. We can also collect feedbacks and see the 
>>>>>>>>>> sentiment from
>>>>>>>>>> users.
>>>>>>>>>>
>>>>>>>>>> Tian
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if
>>>>>>>>>>> a user employs it, it must be the most important element in their 
>>>>>>>>>>> project
>>>>>>>>>>> and everything else must comply with it - pinning versions might be 
>>>>>>>>>>> viable.
>>>>>>>>>>>
>>>>>>>>>>> This is not always true, but definitely a major case.
>>>>>>>>>>>
>>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users
>>>>>>>>>>> use spark with Java
>>>>>>>>>>>
>>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>>>>>> version,
>>>>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how 
>>>>>>>>>>> it works.
>>>>>>>>>>>
>>>>>>>>>>> In short, in Java/Scala dependency management, the pinned
>>>>>>>>>>> version is more like a suggested version, it's easy to override by 
>>>>>>>>>>> users.
>>>>>>>>>>>
>>>>>>>>>>> As Owen pointed out, things are completely different in Python
>>>>>>>>>>> world, both pinned version and latest version seems not ideal, then
>>>>>>>>>>>
>>>>>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>>>>>
>>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still
>>>>>>>>>>> believe we should add a disclaimer that this compatibility only 
>>>>>>>>>>> holds under
>>>>>>>>>>> the assumption that 3rd-party packages strictly adhere to semantic
>>>>>>>>>>> versioning.
>>>>>>>>>>>
>>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>>> resolved
>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>>> for. It's
>>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>>
>>>>>>>>>>> We definitely need such a dependency list in PySpark release,
>>>>>>>>>>> it's really important for users to set up a reproducible 
>>>>>>>>>>> environment after
>>>>>>>>>>> the release several years, and this is also a good reference for 
>>>>>>>>>>> users who
>>>>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency 
>>>>>>>>>>> conflicts when
>>>>>>>>>>> they install lots of packages in single environment.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not
>>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I 
>>>>>>>>>>> can't think
>>>>>>>>>>> of any other Python package that works that way. This thread is 
>>>>>>>>>>> conflating
>>>>>>>>>>> different things.
>>>>>>>>>>>
>>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>>>>>> years? --
>>>>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says 
>>>>>>>>>>> "does not
>>>>>>>>>>> work with any other version at all" which is likely more incorrect 
>>>>>>>>>>> and more
>>>>>>>>>>> problematic for users.
>>>>>>>>>>>
>>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>>> resolved
>>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>>> for. It's
>>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>>
>>>>>>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, 
>>>>>>>>>>> as a
>>>>>>>>>>> default when there is nothing more specific known. That is, write
>>>>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>>>>>
>>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>>>>>>>> because Maven resolution is just pretty different, but mostly 
>>>>>>>>>>> because the
>>>>>>>>>>> core Spark distribution is the 'server side' and is necessarily a 
>>>>>>>>>>> 'fat
>>>>>>>>>>> jar', a sort of statically-compiled artifact that simply has some 
>>>>>>>>>>> specific
>>>>>>>>>>> versions in them and can never have different versions because of 
>>>>>>>>>>> runtime
>>>>>>>>>>> resolution differences.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree that a product must be usable first. Pinning the
>>>>>>>>>>>> version (to a specific number with `==`) will make pyspark 
>>>>>>>>>>>> unusable.
>>>>>>>>>>>>
>>>>>>>>>>>> First of all, I think we can agree that many users use PySpark
>>>>>>>>>>>> with other Python packages. If we conflict with other packages, 
>>>>>>>>>>>> `pip
>>>>>>>>>>>> install -r requirements.txt` won't work. It will complain that the
>>>>>>>>>>>> dependencies can't be resolved, which completely breaks our user's
>>>>>>>>>>>> workflow. Even if the user locks the dependency version, it won't 
>>>>>>>>>>>> work. So
>>>>>>>>>>>> the user had to install PySpark first, then the other packages, to 
>>>>>>>>>>>> override
>>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in a 
>>>>>>>>>>>> single file
>>>>>>>>>>>> - that is a horrible user experience.
>>>>>>>>>>>>
>>>>>>>>>>>> When I look at controversial topics, I always have a strong
>>>>>>>>>>>> belief, that I can't be the only smart person in the world. If an 
>>>>>>>>>>>> idea is
>>>>>>>>>>>> good, others must already be doing it. Can we find any recognized 
>>>>>>>>>>>> package
>>>>>>>>>>>> in the market that pins its dependencies to a specific version? 
>>>>>>>>>>>> The only
>>>>>>>>>>>> case it works is when this package is *all* the user needs. That's 
>>>>>>>>>>>> why we
>>>>>>>>>>>> pin versions for docker images, HTTP services, or standalone tools 
>>>>>>>>>>>> - users
>>>>>>>>>>>> just need something that works out of the box. If we consider 
>>>>>>>>>>>> PySpark the
>>>>>>>>>>>> dominant package - meaning that if a user employs it, it must be 
>>>>>>>>>>>> the most
>>>>>>>>>>>> important element in their project and everything else must comply 
>>>>>>>>>>>> with it
>>>>>>>>>>>> - pinning versions might be viable.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users
>>>>>>>>>>>> use spark with Java, but I'm familiar with the Python ecosystem and
>>>>>>>>>>>> community. If we pin to a specific version, we will face 
>>>>>>>>>>>> significant
>>>>>>>>>>>> criticism. If we must do it, at least don't make it default. Like 
>>>>>>>>>>>> I said
>>>>>>>>>>>> above, I don't have a strong opinion about having a 
>>>>>>>>>>>> `pyspark[pinned]` - if
>>>>>>>>>>>> users only need pyspark and no other packages they could use that. 
>>>>>>>>>>>> But
>>>>>>>>>>>> that's extra effort for maintenance, and we need to think about 
>>>>>>>>>>>> what's
>>>>>>>>>>>> pinned. We have a lot of pyspark install versions.
>>>>>>>>>>>>
>>>>>>>>>>>> Tian Gao
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think the community has already reached consistence to
>>>>>>>>>>>>> freeze dependencies in minor release.
>>>>>>>>>>>>>
>>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are
>>>>>>>>>>>>> minimized in minor releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would interpret the proposed dependency policy applies to
>>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. If 
>>>>>>>>>>>>> so, that
>>>>>>>>>>>>> means PySpark will always use pinned dependencies version since 
>>>>>>>>>>>>> 4.3.0. But
>>>>>>>>>>>>> if the intention is to only apply such a dependency policy to 
>>>>>>>>>>>>> Java/Scala,
>>>>>>>>>>>>> then it creates a very strange situation - an extremely 
>>>>>>>>>>>>> conservative
>>>>>>>>>>>>> dependency management strategy for Java/Scala, and an extremely 
>>>>>>>>>>>>> liberal one
>>>>>>>>>>>>> for Python.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To Tian Gao,
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always
>>>>>>>>>>>>> make us more secure - that's my major point.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Product must be usable first, then security, performance, etc.
>>>>>>>>>>>>> If it claims require `foo>=2.0.0`, how do you ensure it is 
>>>>>>>>>>>>> compatible with
>>>>>>>>>>>>> foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible 
>>>>>>>>>>>>> failures
>>>>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims 
>>>>>>>>>>>>> require
>>>>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with 
>>>>>>>>>>>>> `foo==2.0.0`, and
>>>>>>>>>>>>> users take their own risk to use it with other `foo` versions, 
>>>>>>>>>>>>> for exmaple,
>>>>>>>>>>>>> if the `foo` strictly follow semantic version, it should work with
>>>>>>>>>>>>> `foo<3.0.0`, but this is not Spark's responsibility, users should 
>>>>>>>>>>>>> assess
>>>>>>>>>>>>> and assume the risk of incompatibility themselves.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Response inline
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One possibility would be to make the pinned version optional
>>>>>>>>>>>>>> (eg pyspark[pinned]) or publish a separate constraints file for 
>>>>>>>>>>>>>> people to
>>>>>>>>>>>>>> optionally use with -c?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this
>>>>>>>>>>>>>> is possible today for people using modern Python packaging 
>>>>>>>>>>>>>> workflows that
>>>>>>>>>>>>>> use lock files. In fact, it happens automatically; all transitive
>>>>>>>>>>>>>> dependencies are pinned in the lock file, and this is by design.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda
>>>>>>>>>>>>> where does this come from?
>>>>>>>>>>>>>
>>>>>>>>>>>>> The idea here is we provide the versions we used during the
>>>>>>>>>>>>> release stage so if folks want a “known safe” initial starting 
>>>>>>>>>>>>> point for a
>>>>>>>>>>>>> new env they’ve got one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Furthermore, it is straightforward to add additional
>>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that 
>>>>>>>>>>>>>> when the
>>>>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever 
>>>>>>>>>>>>>> restrictions
>>>>>>>>>>>>>> you want that are specific to your project. That could include 
>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>> versions or version ranges of libraries to exclude, for example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start
>>>>>>>>>>>>> from scratch picking these versions, we can make their lives 
>>>>>>>>>>>>> simpler by
>>>>>>>>>>>>> providing the versions we tested against with a lock file they 
>>>>>>>>>>>>> can choose
>>>>>>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also for interactive workloads I more often see a bare
>>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this 
>>>>>>>>>>>>> could be
>>>>>>>>>>>>> sample bias).
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I had to do this, for example, on a personal project that
>>>>>>>>>>>>>> used PySpark Connect but which was pulling in a version of
>>>>>>>>>>>>>> grpc that was generating a lot of log noise
>>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>>>>> I pinned the version of grpc in my project file and let the 
>>>>>>>>>>>>>> packaging tool
>>>>>>>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>>>>>>>> restrictions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>> Pronouns: she/her
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>> <https://www.fighthealthinsurance.com/?q=hk_email>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> Pronouns: she/her
>>
>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to