Re: [discuss] Pinning PySpark dependencies?

Nicholas Chammas Tue, 19 May 2026 06:01:45 -0700

Holden, you didn’t mention or link to the ticket you filed.

This is the ticket I filed about roughly the same issue back in 2020: 
SPARK-31167 / associated PR <https://github.com/apache/spark/pull/27928>



> On May 18, 2026, at 8:12 PM, Holden Karau <[email protected]> wrote:
> 
> Awesome, I started on one by its super rough so I’ll leave it to you Tian :) 
> (filed a JIRA so grab the existing JIRA for coordination)
> 
> 
> Twitter: https://twitter.com/holdenkarau
> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
> <https://www.fighthealthinsurance.com/?q=hk_email>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> Pronouns: she/her
> 
> On Mon, May 18, 2026 at 5:03 PM Tian Gao <[email protected] 
> <mailto:[email protected]>> wrote:
>> I can work on a prototype. My thought is that we should keep the dependency 
>> list in `pyproject.toml`. We can have dependency groups for all different 
>> scenarios (test/dev, minimum/lint/docs etc). Then for generating docker 
>> images, we include `pyproject.toml` and pip install based on that. I believe 
>> we can keep the only truth in that file (which is a common way to do things) 
>> and still be flexible.
>> 
>> On Mon, May 18, 2026 at 4:55 PM Holden Karau <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Single source of truth does sound desirable, let me take a look at 
>>> narrowing that down a bit too.
>>> 
>>> On Mon, May 18, 2026 at 4:30 PM Tian Gao via dev <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> We can do either a list of packages from `pip freeze` on our website, or a 
>>>> `pyspark[pinned]` that has `==`. I'm okay with either (or both).
>>>> 
>>>> If we want to do that, we probably want to pin our package versions on our 
>>>> stable spark versions. We only partially pin our dependencies for our CI 
>>>> for maintenance branches, so we do not even have the list now (we may have 
>>>> it for a certain date, but the list could change any time in the future).
>>>> 
>>>> I think we should come up with a more official CI system so we always test 
>>>> the released versions (4.0, 4.1 ...) with a pinned versions of packages 
>>>> (which are the "known working dependencies"), and be more relaxed for dev 
>>>> branches (4.x, master) because we need to test against new releases for 
>>>> our dependencies.
>>>> 
>>>> More importantly, it would be really nice to have a single source of 
>>>> truth. We have to many places to pin the python dependency versions.
>>>> 
>>>> Tian
>>>> 
>>>> On Sun, May 17, 2026 at 9:52 AM Holden Karau <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> I am at PyCon USA Today and the PyPi head just did a call out to audit 
>>>>> and pin dependencies because the supply chain attacks are increasing 
>>>>> hockey stick style.
>>>>> 
>>>>> I think we don’t need to pin just yet but let’s add publishing the 
>>>>> package versions we built with during CI.
>>>>> 
>>>>> 
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>> 
>>>>> On Wed, Apr 1, 2026 at 7:48 AM Devin Petersohn via dev 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> I think we should do something in response to the growing supply chain 
>>>>>> attacks rather than just leaving the problem to users. One alternative 
>>>>>> we could consider for Python specifically is an install target with 
>>>>>> upper bounded dependencies: `pip install "pyspark[deps-upper-bounded]"`. 
>>>>>> This wouldn't impact regular use, and seems like it would solve the 
>>>>>> other problems with publishing lock files, etc. As others have 
>>>>>> mentioned, this wouldn't *guarantee* security, but it would provide 
>>>>>> meaningful protection against the worst offenders we've recently seen.
>>>>>> 
>>>>>> On Wed, Apr 1, 2026 at 9:37 AM Cheng Pan <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> > How about as a compromise, we publish (but don’t lock to) the pip 
>>>>>>> > freeze outputs of the venvs we use for testing?
>>>>>>> 
>>>>>>> > Where do you propose to publish? Spark website? Maybe in our github 
>>>>>>> > repo somewhere?
>>>>>>> 
>>>>>>> > I was thinking just in the publisher artifacts directory we already 
>>>>>>> > do.
>>>>>>> 
>>>>>>> +1, I'm fine with any approach, as long as it provides sufficient info 
>>>>>>> to let user know which exactly version of dependencies was used for 
>>>>>>> testing. 
>>>>>>> 
>>>>>>> For Java/Scala, we have a script[1] generated dependency list in code 
>>>>>>> repo, at [2]
>>>>>>> 
>>>>>>> [1] 
>>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/test-dependencies.sh
>>>>>>> [2] 
>>>>>>> https://github.com/apache/spark/blob/branch-4.1/dev/deps/spark-deps-hadoop-3-hive-2.3
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Cheng Pan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 31, 2026, at 03:12, Holden Karau <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> I was thinking just in the publisher artifacts directory we already do.
>>>>>>>> 
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>> Pronouns: she/her
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Mar 30, 2026 at 10:26 AM Tian Gao <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> Where do you propose to publish? Spark website? Maybe in our github 
>>>>>>>>> repo somewhere? For python packages, users rarely look for artifacts 
>>>>>>>>> (and it's difficult to find).
>>>>>>>>> 
>>>>>>>>> Tian
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 30, 2026 at 10:04 AM Holden Karau <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> I hear that. How about as a compromise, we publish (but don’t lock 
>>>>>>>>>> to) the pip freeze outputs of the venvs we use for testing?
>>>>>>>>>> 
>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>> Pronouns: she/her
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 30, 2026 at 8:04 AM Nicholas Chammas 
>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> I think supply chain attacks are a problem, but I don’t think we 
>>>>>>>>>>> want to be on the hook for a solution here, even if it’s meant just 
>>>>>>>>>>> for our project.
>>>>>>>>>>> 
>>>>>>>>>>> There are “good enough” approaches available today for Python that 
>>>>>>>>>>> mitigate most of the risk by excluding recent releases when 
>>>>>>>>>>> resolving what package versions to install.
>>>>>>>>>>> 
>>>>>>>>>>> uv offers exclude-newer 
>>>>>>>>>>> <https://docs.astral.sh/uv/reference/settings/#exclude-newer>. pip 
>>>>>>>>>>> offers uploaded-prior-to 
>>>>>>>>>>> <https://pip.pypa.io/en/stable/cli/pip_index/#cmdoption-uploaded-prior-to>.
>>>>>>>>>>>  Poetry has an issue open 
>>>>>>>>>>> <https://github.com/python-poetry/poetry/issues/10646> for a 
>>>>>>>>>>> similar feature, plus at least one open PR to close it.
>>>>>>>>>>> 
>>>>>>>>>>> Users concerned about supply chain attacks would probably get 
>>>>>>>>>>> better results from using these options as compared to installing 
>>>>>>>>>>> pinned dependencies provided by the projects they use.
>>>>>>>>>>> 
>>>>>>>>>>> Nick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 30, 2026, at 3:31 AM, Holden Karau <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> So I think we can ship it as an optional distribution element 
>>>>>>>>>>>> (it's literally just another file folks can choose to download/use 
>>>>>>>>>>>> if they want).
>>>>>>>>>>>> 
>>>>>>>>>>>> Asking users is an idea too, I could put together a survey if we 
>>>>>>>>>>>> want?
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Mar 29, 2026 at 11:14 PM Tian Gao via dev 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> I believe "foo~=2.0.1" is a syntax sugar for "foo>=2.0.1, 
>>>>>>>>>>>>> foo==2.0.*". Similarly, "foo>=2.0.0, <3.0.0" is "foo~=2.0". This 
>>>>>>>>>>>>> is a nit and we don't need to focus on the syntax.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I don't believe we can ship pyspark with a env lock file. That's 
>>>>>>>>>>>>> what users do in their own projects. It's not part of python 
>>>>>>>>>>>>> package system. What users do is normally install packages, test 
>>>>>>>>>>>>> it out, then lock it with either pip or uv - generate a lock file 
>>>>>>>>>>>>> for all dependencies and use it across their systems. It's not 
>>>>>>>>>>>>> common for packages to list out a "known working dependency list" 
>>>>>>>>>>>>> for users.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, if we really want to try it out, we can do something 
>>>>>>>>>>>>> like `pip install pyspark[full-pinned] and install every 
>>>>>>>>>>>>> dependency pyspark requires with a pinned version. If our user 
>>>>>>>>>>>>> needs an out-of-box solution they can do that. We can also 
>>>>>>>>>>>>> collect feedbacks and see the sentiment from users.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Tian 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:29 PM Cheng Pan <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> > If we consider PySpark the dominant package - meaning that if 
>>>>>>>>>>>>>> > a user employs it, it must be the most important element in 
>>>>>>>>>>>>>> > their project and everything else must comply with it - 
>>>>>>>>>>>>>> > pinning versions might be viable.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is not always true, but definitely a major case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > I'm not familiar with Java dependency solutions or how users 
>>>>>>>>>>>>>> > use spark with Java
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In Java/Scala, it's rare to use dynamic version for dependency 
>>>>>>>>>>>>>> management. Product declares transitive dependencies with pinned 
>>>>>>>>>>>>>> version, and the package manager (Maven, SBT, Gradle, etc.) 
>>>>>>>>>>>>>> picks the most reasonable version based on resolution rules. The 
>>>>>>>>>>>>>> rules is a little different in Maven, SBT and Gradle, the Maven 
>>>>>>>>>>>>>> docs[1] explains how it works.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In short, in Java/Scala dependency management, the pinned 
>>>>>>>>>>>>>> version is more like a suggested version, it's easy to override 
>>>>>>>>>>>>>> by users.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As Owen pointed out, things are completely different in Python 
>>>>>>>>>>>>>> world, both pinned version and latest version seems not ideal, 
>>>>>>>>>>>>>> then
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. pinned version (foo==2.0.0)
>>>>>>>>>>>>>> 2. allow maintenance releases (foo~=2.0.0)
>>>>>>>>>>>>>> 3. allow minor feature releases (foo>=2.0.0,<3.0.0)
>>>>>>>>>>>>>> 4. latest version (foo>=2.0.0, or foo)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> seems 2 or 3 might be an acceptable solution? And, I still 
>>>>>>>>>>>>>> believe we should add a disclaimer that this compatibility only 
>>>>>>>>>>>>>> holds under the assumption that 3rd-party packages strictly 
>>>>>>>>>>>>>> adhere to semantic versioning.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> > You can totally produce a sort of 'lock' file -- uv.lock, 
>>>>>>>>>>>>>> > requirements.txt -- expressing a known good / recommended 
>>>>>>>>>>>>>> > specific resolved environment. That is _not_ what Python 
>>>>>>>>>>>>>> > dependency constraints are for. It's what env lock flies are 
>>>>>>>>>>>>>> > for.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We definitely need such a dependency list in PySpark release, 
>>>>>>>>>>>>>> it's really important for users to set up a reproducible 
>>>>>>>>>>>>>> environment after the release several years, and this is also a 
>>>>>>>>>>>>>> good reference for users who encounter 3rd-party packages bugs, 
>>>>>>>>>>>>>> or battle with dependency conflicts when they install lots of 
>>>>>>>>>>>>>> packages in single environment.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1] 
>>>>>>>>>>>>>> https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 30, 2026, at 11:13, Sean Owen <[email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> TL;DR Tian is more correct, and == pinning versions is not 
>>>>>>>>>>>>>>> achieving the desired outcome. There are other ways to do it; I 
>>>>>>>>>>>>>>> can't think of any other Python package that works that way. 
>>>>>>>>>>>>>>> This thread is conflating different things.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> While expressing dependence on "foo>=2.0.0" indeed can be an 
>>>>>>>>>>>>>>> overly-broad claim -- do you really think it works with 5.x in 
>>>>>>>>>>>>>>> 10 years? -- expressing "foo==2.0.0" is very likely overly 
>>>>>>>>>>>>>>> narrow. That says "does not work with any other version at all" 
>>>>>>>>>>>>>>> which is likely more incorrect and more problematic for users.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You can totally produce a sort of 'lock' file -- uv.lock, 
>>>>>>>>>>>>>>> requirements.txt -- expressing a known good / recommended 
>>>>>>>>>>>>>>> specific resolved environment. That is _not_ what Python 
>>>>>>>>>>>>>>> dependency constraints are for. It's what env lock flies are 
>>>>>>>>>>>>>>> for.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> To be sure there is an art to figuring out the right dependency 
>>>>>>>>>>>>>>> bounds. A reasonable compromise is to allow maintenance 
>>>>>>>>>>>>>>> releases, as a default when there is nothing more specific 
>>>>>>>>>>>>>>> known. That is, write "foo~=2.0.2" to mean ">=2.0.0 and < 2.1".
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The analogy to Scala/Java/Maven land does not quite work, 
>>>>>>>>>>>>>>> partly because Maven resolution is just pretty different, but 
>>>>>>>>>>>>>>> mostly because the core Spark distribution is the 'server side' 
>>>>>>>>>>>>>>> and is necessarily a 'fat jar', a sort of statically-compiled 
>>>>>>>>>>>>>>> artifact that simply has some specific versions in them and can 
>>>>>>>>>>>>>>> never have different versions because of runtime resolution 
>>>>>>>>>>>>>>> differences. 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 10:02 PM Tian Gao via dev 
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> I agree that a product must be usable first. Pinning the 
>>>>>>>>>>>>>>>> version (to a specific number with `==`) will make pyspark 
>>>>>>>>>>>>>>>> unusable.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> First of all, I think we can agree that many users use PySpark 
>>>>>>>>>>>>>>>> with other Python packages. If we conflict with other 
>>>>>>>>>>>>>>>> packages, `pip install -r requirements.txt` won't work. It 
>>>>>>>>>>>>>>>> will complain that the dependencies can't be resolved, which 
>>>>>>>>>>>>>>>> completely breaks our user's workflow. Even if the user locks 
>>>>>>>>>>>>>>>> the dependency version, it won't work. So the user had to 
>>>>>>>>>>>>>>>> install PySpark first, then the other packages, to override 
>>>>>>>>>>>>>>>> PySpark's dependency. They can't put their dependency list in 
>>>>>>>>>>>>>>>> a single file - that is a horrible user experience.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> When I look at controversial topics, I always have a strong 
>>>>>>>>>>>>>>>> belief, that I can't be the only smart person in the world. If 
>>>>>>>>>>>>>>>> an idea is good, others must already be doing it. Can we find 
>>>>>>>>>>>>>>>> any recognized package in the market that pins its 
>>>>>>>>>>>>>>>> dependencies to a specific version? The only case it works is 
>>>>>>>>>>>>>>>> when this package is *all* the user needs. That's why we pin 
>>>>>>>>>>>>>>>> versions for docker images, HTTP services, or standalone tools 
>>>>>>>>>>>>>>>> - users just need something that works out of the box. If we 
>>>>>>>>>>>>>>>> consider PySpark the dominant package - meaning that if a user 
>>>>>>>>>>>>>>>> employs it, it must be the most important element in their 
>>>>>>>>>>>>>>>> project and everything else must comply with it - pinning 
>>>>>>>>>>>>>>>> versions might be viable.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm not familiar with Java dependency solutions or how users 
>>>>>>>>>>>>>>>> use spark with Java, but I'm familiar with the Python 
>>>>>>>>>>>>>>>> ecosystem and community. If we pin to a specific version, we 
>>>>>>>>>>>>>>>> will face significant criticism. If we must do it, at least 
>>>>>>>>>>>>>>>> don't make it default. Like I said above, I don't have a 
>>>>>>>>>>>>>>>> strong opinion about having a `pyspark[pinned]` - if users 
>>>>>>>>>>>>>>>> only need pyspark and no other packages they could use that. 
>>>>>>>>>>>>>>>> But that's extra effort for maintenance, and we need to think 
>>>>>>>>>>>>>>>> about what's pinned. We have a lot of pyspark install versions.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Tian Gao
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sun, Mar 29, 2026 at 7:12 PM Cheng Pan <[email protected] 
>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> I think the community has already reached consistence to 
>>>>>>>>>>>>>>>>> freeze dependencies in minor release.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> SPARK-54633 - SPIP: Accelerating Apache Spark Release Cadence 
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> > Clear rules for changes allowed in minor vs. major releases:
>>>>>>>>>>>>>>>>> > - Dependencies are frozen and behavioral changes are 
>>>>>>>>>>>>>>>>> > minimized in minor releases.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I would interpret the proposed dependency policy applies to 
>>>>>>>>>>>>>>>>> both Java/Scala and Python dependency management for Spark. 
>>>>>>>>>>>>>>>>> If so, that means PySpark will always use pinned dependencies 
>>>>>>>>>>>>>>>>> version since 4.3.0. But if the intention is to only apply 
>>>>>>>>>>>>>>>>> such a dependency policy to Java/Scala, then it creates a 
>>>>>>>>>>>>>>>>> very strange situation - an extremely conservative dependency 
>>>>>>>>>>>>>>>>> management strategy for Java/Scala, and an extremely liberal 
>>>>>>>>>>>>>>>>> one for Python.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> To Tian Gao,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> > Pinning versions is a double-edged sword, it doesn't always 
>>>>>>>>>>>>>>>>> > make us more secure - that's my major point.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Product must be usable first, then security, performance, 
>>>>>>>>>>>>>>>>> etc. If it claims require `foo>=2.0.0`, how do you ensure it 
>>>>>>>>>>>>>>>>> is compatible with foo `2.3.4`, `3.x.x`, `4.x.x`? Actually, 
>>>>>>>>>>>>>>>>> such incompatible failures occurred many times, e.g.,[2]. On 
>>>>>>>>>>>>>>>>> the contrary, if it claims require `foo==2.0.0`, that means 
>>>>>>>>>>>>>>>>> it was thoroughly tested with `foo==2.0.0`, and users take 
>>>>>>>>>>>>>>>>> their own risk to use it with other `foo` versions, for 
>>>>>>>>>>>>>>>>> exmaple, if the `foo` strictly follow semantic version, it 
>>>>>>>>>>>>>>>>> should work with `foo<3.0.0`, but this is not Spark's 
>>>>>>>>>>>>>>>>> responsibility, users should assess and assume the risk of 
>>>>>>>>>>>>>>>>> incompatibility themselves.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/SPARK-54633
>>>>>>>>>>>>>>>>> [2] https://github.com/apache/spark/pull/52633
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Cheng Pan
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Mar 28, 2026, at 06:59, Holden Karau 
>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Response inline 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>>>>>> Fight Health Insurance: 
>>>>>>>>>>>>>>>>>> https://www.fighthealthinsurance.com/ 
>>>>>>>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>>>>>>> YouTube Live Streams: 
>>>>>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Mar 27, 2026 at 1:01 PM Nicholas Chammas 
>>>>>>>>>>>>>>>>>> <[email protected] 
>>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Mar 27, 2026, at 12:31 PM, Holden Karau 
>>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> One possibility would be to make the pinned version 
>>>>>>>>>>>>>>>>>>>> optional (eg pyspark[pinned]) or publish a separate 
>>>>>>>>>>>>>>>>>>>> constraints file for people to optionally use with -c?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Perhaps I am misunderstanding your proposal, Holden, but 
>>>>>>>>>>>>>>>>>>> this is possible today for people using modern Python 
>>>>>>>>>>>>>>>>>>> packaging workflows that use lock files. In fact, it 
>>>>>>>>>>>>>>>>>>> happens automatically; all transitive dependencies are 
>>>>>>>>>>>>>>>>>>> pinned in the lock file, and this is by design.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> So for someone installing a fresh venv with uv/pip/or conda 
>>>>>>>>>>>>>>>>>> where does this come from?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The idea here is we provide the versions we used during the 
>>>>>>>>>>>>>>>>>> release stage so if folks want a “known safe” initial 
>>>>>>>>>>>>>>>>>> starting point for a new env they’ve got one.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Furthermore, it is straightforward to add additional 
>>>>>>>>>>>>>>>>>>> restrictions to your project spec (i.e. pyproject.toml) so 
>>>>>>>>>>>>>>>>>>> that when the packaging tool builds the lock file, it does 
>>>>>>>>>>>>>>>>>>> it with whatever restrictions you want that are specific to 
>>>>>>>>>>>>>>>>>>> your project. That could include specific versions or 
>>>>>>>>>>>>>>>>>>> version ranges of libraries to exclude, for example.
>>>>>>>>>>>>>>>>>> Yes, but as it stands we leave it to the end user to start 
>>>>>>>>>>>>>>>>>> from scratch picking these versions, we can make their lives 
>>>>>>>>>>>>>>>>>> simpler by providing the versions we tested against with a 
>>>>>>>>>>>>>>>>>> lock file they can choose to use, ignore, or update to their 
>>>>>>>>>>>>>>>>>> desired versions and include.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Also for interactive workloads I more often see a bare 
>>>>>>>>>>>>>>>>>> requirements file or even pip installs in nb cells (but this 
>>>>>>>>>>>>>>>>>> could be sample bias).
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I had to do this, for example, on a personal project that 
>>>>>>>>>>>>>>>>>>> used PySpark Connect but which was pulling in a version of 
>>>>>>>>>>>>>>>>>>> grpc that was generating a lot of log noise 
>>>>>>>>>>>>>>>>>>> <https://github.com/grpc/grpc/issues/38336#issuecomment-2588422915>.
>>>>>>>>>>>>>>>>>>>  I pinned the version of grpc in my project file and let 
>>>>>>>>>>>>>>>>>>> the packaging tool resolve all the requirements across 
>>>>>>>>>>>>>>>>>>> PySpark Connect and my custom restrictions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Nick
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>>>>>>>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.): 
>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>> Pronouns: she/her
>>>>>>>>>>> 
>>>>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ 
>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>> Books (Learning Spark, High Performance Spark, etc.): 
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> Pronouns: she/her

Re: [discuss] Pinning PySpark dependencies?

Reply via email to