Re: [discuss] Pinning PySpark dependencies?

Tian Gao via dev Mon, 18 May 2026 17:03:26 -0700

I can work on a prototype. My thought is that we should keep the dependency
list in `pyproject.toml`. We can have dependency groups for all different
scenarios (test/dev, minimum/lint/docs etc). Then for generating docker
images, we include `pyproject.toml` and pip install based on that. I
believe we can keep the only truth in that file (which is a common way to
do things) and still be flexible.


On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected]> wrote:

> Single source of truth does sound desirable, let me take a look at
> narrowing that down a bit too.
>
> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected]>
> wrote:
>
>> We can do either a list of packages from `pip freeze` on our website, or
>> a `pyspark[pinned]` that has `==`. I'm okay with either (or both).
>>
>> If we want to do that, we probably want to pin our package versions on
>> our stable spark versions. We only partially pin our dependencies for our
>> CI for maintenance branches, so we do not even have the list now (we may
>> have it for a certain date, but the list could change any time in the
>> future).
>>
>> I think we should come up with a more official CI system so we always
>> test the released versions (4.0, 4.1 ...) with a pinned versions of
>> packages (which are the "known working dependencies"), and be more relaxed
>> for dev branches (4.x, master) because we need to test against new releases
>> for our dependencies.
>>
>> More importantly, it would be really nice to have a single source of
>> truth. We have to many places to pin the python dependency versions.
>>
>> Tian
>>
>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected]>
>> wrote:
>>
>>> I am at PyCon USA Today and the PyPi head just did a call out to audit
>>> and pin dependencies because the supply chain attacks are increasing hockey
>>> stick style.
>>>
>>> I think we don’t need to pin just yet but let’s add publishing the
>>> package versions we built with during CI.
>>>
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her
>>>
>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev <
>>> [email protected]> wrote:
>>>
>>>> I think we should do something in response to the growing supply chain
>>>> attacks rather than just leaving the problem to users. One alternative we
>>>> could consider for Python specifically is an install target with upper
>>>> bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. This
>>>> wouldn't impact regular use, and seems like it would solve the other
>>>> problems with publishing lock files, etc. As others have mentioned, this
>>>> wouldn't *guarantee* security, but it would provide meaningful protection
>>>> against the worst offenders we've recently seen.
>>>>
>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected]> wrote:
>>>>
>>>>> > How about as a compromise, we publish (but don’t lock to) the pip
>>>>> freeze outputs of the venvs we use for testing?
>>>>>
>>>>> > Where do you propose to publish? Spark website? Maybe in our github
>>>>> repo somewhere?
>>>>>
>>>>> > I was thinking just in the publisher artifacts directory we already
>>>>> do.
>>>>>
>>>>> +1, I'm fine with any approach, as long as it provides sufficient info
>>>>> to let user know which exactly version of dependencies was used for
>>>>> testing.
>>>>>
>>>>> For Java/Scala, we have a script[1] generated dependency list in code
>>>>> repo, at [2]
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>>>>> [2]
>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>>
>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected]>
>>>>> wrote:
>>>>>
>>>>> I was thinking just in the publisher artifacts directory we already do.
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>>
>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Where do you propose to publish? Spark website? Maybe in our github
>>>>>> repo somewhere? For python packages, users rarely look for artifacts (and
>>>>>> it's difficult to find).
>>>>>>
>>>>>> Tian
>>>>>>
>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock
>>>>>>> to) the pip freeze outputs of the venvs we use for testing?
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>> Pronouns: she/her
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think supply chain attacks are a problem, but I don’t think we
>>>>>>>> want to be on the hook for a solution here, even if it’s meant just 
>>>>>>>> for our
>>>>>>>> project.
>>>>>>>>
>>>>>>>> There are “good enough” approaches available today for Python that
>>>>>>>> mitigate most of the risk by excluding recent releases when resolving 
>>>>>>>> what
>>>>>>>> package versions to install.
>>>>>>>>
>>>>>>>> uv offers exclude-newer
>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip
>>>>>>>> offers uploaded-prior-to
>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>>>>> Poetry has an issue open
>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a
>>>>>>>> similar feature, plus at least one open PR to close it.
>>>>>>>>
>>>>>>>> Users concerned about supply chain attacks would probably get
>>>>>>>> better results from using these options as compared to installing 
>>>>>>>> pinned
>>>>>>>> dependencies provided by the projects they use.
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> So I think we can ship it as an optional distribution element (it's
>>>>>>>> literally just another file folks can choose to download/use if they 
>>>>>>>> want).
>>>>>>>>
>>>>>>>> Asking users is an idea too, I could put together a survey if we
>>>>>>>> want?
>>>>>>>>
>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1,
>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This is a 
>>>>>>>>> nit
>>>>>>>>> and we don't need to focus on the syntax.
>>>>>>>>>
>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's
>>>>>>>>> what users do in their own projects. It's not part of python package
>>>>>>>>> system. What users do is normally install packages, test it out, then 
>>>>>>>>> lock
>>>>>>>>> it with either pip or uv - generate a lock file for all dependencies 
>>>>>>>>> and
>>>>>>>>> use it across their systems. It's not common for packages to list out 
>>>>>>>>> a
>>>>>>>>> "known working dependency list" for users.
>>>>>>>>>
>>>>>>>>> However, if we really want to try it out, we can do something like
>>>>>>>>> `pip install pyspark[full-pinned] and install every dependency pyspark
>>>>>>>>> requires with a pinned version. If our user needs an out-of-box 
>>>>>>>>> solution
>>>>>>>>> they can do that. We can also collect feedbacks and see the sentiment 
>>>>>>>>> from
>>>>>>>>> users.
>>>>>>>>>
>>>>>>>>> Tian
>>>>>>>>>
>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if a
>>>>>>>>>> user employs it, it must be the most important element in their 
>>>>>>>>>> project and
>>>>>>>>>> everything else must comply with it - pinning versions might be 
>>>>>>>>>> viable.
>>>>>>>>>>
>>>>>>>>>> This is not always true, but definitely a major case.
>>>>>>>>>>
>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users
>>>>>>>>>> use spark with Java
>>>>>>>>>>
>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency
>>>>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>>>>> version,
>>>>>>>>>> and the package manager (Maven, SBT, Gradle, etc.) picks the most
>>>>>>>>>> reasonable version based on resolution rules. The rules is a little
>>>>>>>>>> different in Maven, SBT and Gradle, the Maven docs[1] explains how 
>>>>>>>>>> it works.
>>>>>>>>>>
>>>>>>>>>> In short, in Java/Scala dependency management, the pinned version
>>>>>>>>>> is more like a suggested version, it's easy to override by users.
>>>>>>>>>>
>>>>>>>>>> As Owen pointed out, things are completely different in Python
>>>>>>>>>> world, both pinned version and latest version seems not ideal, then
>>>>>>>>>>
>>>>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>>>>
>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still
>>>>>>>>>> believe we should add a disclaimer that this compatibility only 
>>>>>>>>>> holds under
>>>>>>>>>> the assumption that 3rd-party packages strictly adhere to semantic
>>>>>>>>>> versioning.
>>>>>>>>>>
>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>> resolved
>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>> for. It's
>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>
>>>>>>>>>> We definitely need such a dependency list in PySpark release,
>>>>>>>>>> it's really important for users to set up a reproducible environment 
>>>>>>>>>> after
>>>>>>>>>> the release several years, and this is also a good reference for 
>>>>>>>>>> users who
>>>>>>>>>> encounter 3rd-party packages bugs, or battle with dependency 
>>>>>>>>>> conflicts when
>>>>>>>>>> they install lots of packages in single environment.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Cheng Pan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not
>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I 
>>>>>>>>>> can't think
>>>>>>>>>> of any other Python package that works that way. This thread is 
>>>>>>>>>> conflating
>>>>>>>>>> different things.
>>>>>>>>>>
>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an
>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 10 
>>>>>>>>>> years? --
>>>>>>>>>> expressing "foo==2.0.0" is very likely overly narrow. That says 
>>>>>>>>>> "does not
>>>>>>>>>> work with any other version at all" which is likely more incorrect 
>>>>>>>>>> and more
>>>>>>>>>> problematic for users.
>>>>>>>>>>
>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock,
>>>>>>>>>> requirements.txt -- expressing a known good / recommended specific 
>>>>>>>>>> resolved
>>>>>>>>>> environment. That is _not_ what Python dependency constraints are 
>>>>>>>>>> for. It's
>>>>>>>>>> what env lock flies are for.
>>>>>>>>>>
>>>>>>>>>> To be sure there is an art to figuring out the right dependency
>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance releases, as 
>>>>>>>>>> a
>>>>>>>>>> default when there is nothing more specific known. That is, write
>>>>>>>>>> "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>>>>
>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, partly
>>>>>>>>>> because Maven resolution is just pretty different, but mostly 
>>>>>>>>>> because the
>>>>>>>>>> core Spark distribution is the 'server side' and is necessarily a 
>>>>>>>>>> 'fat
>>>>>>>>>> jar', a sort of statically-compiled artifact that simply has some 
>>>>>>>>>> specific
>>>>>>>>>> versions in them and can never have different versions because of 
>>>>>>>>>> runtime
>>>>>>>>>> resolution differences.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree that a product must be usable first. Pinning the version
>>>>>>>>>>> (to a specific number with `==`) will make pyspark unusable.
>>>>>>>>>>>
>>>>>>>>>>> First of all, I think we can agree that many users use PySpark
>>>>>>>>>>> with other Python packages. If we conflict with other packages, `pip
>>>>>>>>>>> install -r requirements.txt` won't work. It will complain that the
>>>>>>>>>>> dependencies can't be resolved, which completely breaks our user's
>>>>>>>>>>> workflow. Even if the user locks the dependency version, it won't 
>>>>>>>>>>> work. So
>>>>>>>>>>> the user had to install PySpark first, then the other packages, to 
>>>>>>>>>>> override
>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in a 
>>>>>>>>>>> single file
>>>>>>>>>>> - that is a horrible user experience.
>>>>>>>>>>>
>>>>>>>>>>> When I look at controversial topics, I always have a strong
>>>>>>>>>>> belief, that I can't be the only smart person in the world. If an 
>>>>>>>>>>> idea is
>>>>>>>>>>> good, others must already be doing it. Can we find any recognized 
>>>>>>>>>>> package
>>>>>>>>>>> in the market that pins its dependencies to a specific version? The 
>>>>>>>>>>> only
>>>>>>>>>>> case it works is when this package is *all* the user needs. That's 
>>>>>>>>>>> why we
>>>>>>>>>>> pin versions for docker images, HTTP services, or standalone tools 
>>>>>>>>>>> - users
>>>>>>>>>>> just need something that works out of the box. If we consider 
>>>>>>>>>>> PySpark the
>>>>>>>>>>> dominant package - meaning that if a user employs it, it must be 
>>>>>>>>>>> the most
>>>>>>>>>>> important element in their project and everything else must comply 
>>>>>>>>>>> with it
>>>>>>>>>>> - pinning versions might be viable.
>>>>>>>>>>>
>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users use
>>>>>>>>>>> spark with Java, but I'm familiar with the Python ecosystem and 
>>>>>>>>>>> community.
>>>>>>>>>>> If we pin to a specific version, we will face significant 
>>>>>>>>>>> criticism. If we
>>>>>>>>>>> must do it, at least don't make it default. Like I said above, I 
>>>>>>>>>>> don't have
>>>>>>>>>>> a strong opinion about having a `pyspark[pinned]` - if users only 
>>>>>>>>>>> need
>>>>>>>>>>> pyspark and no other packages they could use that. But that's extra 
>>>>>>>>>>> effort
>>>>>>>>>>> for maintenance, and we need to think about what's pinned. We have 
>>>>>>>>>>> a lot of
>>>>>>>>>>> pyspark install versions.
>>>>>>>>>>>
>>>>>>>>>>> Tian Gao
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think the community has already reached consistence to freeze
>>>>>>>>>>>> dependencies in minor release.
>>>>>>>>>>>>
>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence
>>>>>>>>>>>> [1]
>>>>>>>>>>>>
>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are
>>>>>>>>>>>> minimized in minor releases.
>>>>>>>>>>>>
>>>>>>>>>>>> I would interpret the proposed dependency policy applies to
>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. If so, 
>>>>>>>>>>>> that
>>>>>>>>>>>> means PySpark will always use pinned dependencies version since 
>>>>>>>>>>>> 4.3.0. But
>>>>>>>>>>>> if the intention is to only apply such a dependency policy to 
>>>>>>>>>>>> Java/Scala,
>>>>>>>>>>>> then it creates a very strange situation - an extremely 
>>>>>>>>>>>> conservative
>>>>>>>>>>>> dependency management strategy for Java/Scala, and an extremely 
>>>>>>>>>>>> liberal one
>>>>>>>>>>>> for Python.
>>>>>>>>>>>>
>>>>>>>>>>>> To Tian Gao,
>>>>>>>>>>>>
>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always
>>>>>>>>>>>> make us more secure - that's my major point.
>>>>>>>>>>>>
>>>>>>>>>>>> Product must be usable first, then security, performance, etc.
>>>>>>>>>>>> If it claims require `foo>=2.0.0`, how do you ensure it is 
>>>>>>>>>>>> compatible with
>>>>>>>>>>>> foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, such incompatible failures
>>>>>>>>>>>> occurred many times, e.g.,[2]. On the contrary, if it claims 
>>>>>>>>>>>> require
>>>>>>>>>>>> `foo==2.0.0`, that means it was thoroughly tested with 
>>>>>>>>>>>> `foo==2.0.0`, and
>>>>>>>>>>>> users take their own risk to use it with other `foo` versions, for 
>>>>>>>>>>>> exmaple,
>>>>>>>>>>>> if the `foo` strictly follow semantic version, it should work with
>>>>>>>>>>>> `foo<3.0.0`, but this is not Spark's responsibility, users should 
>>>>>>>>>>>> assess
>>>>>>>>>>>> and assume the risk of incompatibility themselves.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Response inline
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> One possibility would be to make the pinned version optional
>>>>>>>>>>>>> (eg pyspark[pinned]) or publish a separate constraints file for 
>>>>>>>>>>>>> people to
>>>>>>>>>>>>> optionally use with -c?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but this
>>>>>>>>>>>>> is possible today for people using modern Python packaging 
>>>>>>>>>>>>> workflows that
>>>>>>>>>>>>> use lock files. In fact, it happens automatically; all transitive
>>>>>>>>>>>>> dependencies are pinned in the lock file, and this is by design.
>>>>>>>>>>>>>
>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda
>>>>>>>>>>>> where does this come from?
>>>>>>>>>>>>
>>>>>>>>>>>> The idea here is we provide the versions we used during the
>>>>>>>>>>>> release stage so if folks want a “known safe” initial starting 
>>>>>>>>>>>> point for a
>>>>>>>>>>>> new env they’ve got one.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Furthermore, it is straightforward to add additional
>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so that 
>>>>>>>>>>>>> when the
>>>>>>>>>>>>> packaging tool builds the lock file, it does it with whatever 
>>>>>>>>>>>>> restrictions
>>>>>>>>>>>>> you want that are specific to your project. That could include 
>>>>>>>>>>>>> specific
>>>>>>>>>>>>> versions or version ranges of libraries to exclude, for example.
>>>>>>>>>>>>>
>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start from
>>>>>>>>>>>> scratch picking these versions, we can make their lives simpler by
>>>>>>>>>>>> providing the versions we tested against with a lock file they can 
>>>>>>>>>>>> choose
>>>>>>>>>>>> to use, ignore, or update to their desired versions and include.
>>>>>>>>>>>>
>>>>>>>>>>>> Also for interactive workloads I more often see a bare
>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this could 
>>>>>>>>>>>> be
>>>>>>>>>>>> sample bias).
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I had to do this, for example, on a personal project that used
>>>>>>>>>>>>> PySpark Connect but which was pulling in a version of grpc
>>>>>>>>>>>>> that was generating a lot of log noise
>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>>>> I pinned the version of grpc in my project file and let the 
>>>>>>>>>>>>> packaging tool
>>>>>>>>>>>>> resolve all the requirements across PySpark Connect and my custom
>>>>>>>>>>>>> restrictions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
>

Re: [discuss] Pinning PySpark dependencies?

Reply via email to