I concur with you Sean.

If I understand correctly the point raised by the thread owner, in
heterogeneous environments that we work, it is up to the practitioner to
ensure that there is version compatibility among OS versions, spark version
and the target artefact in consideration. For example if I try to connect
to Google BigQuery from spark 3.4.0, my OS or for that matter, the docker
needs to run Java 8 regardless of  spark Java version, otherwise it will
fail.

I think these details should be left to the trenches, because these
arguments about versioning become tangential in the big picture.  Case in
point, my current OS scala version is 2.13.8 but works fine with Spark
built on 2.12.17.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 6 Jun 2023 at 01:37, Sean Owen <sro...@gmail.com> wrote:

> I think the issue is whether a distribution of Spark is so materially
> different from OSS that it causes problems for the larger community of
> users. There's a legitimate question of whether such a thing can be called
> "Apache Spark + changes", as describing it that way becomes meaningfully
> inaccurate. And if it's inaccurate, then it's a trademark usage issue, and
> a matter for the PMC to act on. I certainly recall this type of problem
> from the early days of Hadoop - the project itself had 2 or 3 live branches
> in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up
> by different vendors and it was unclear what "Apache Hadoop" meant in a
> vendor distro. Or frankly, upstream.
>
> In comparison, variation in Scala maintenance release seems trivial. I'm
> not clear from the thread what actual issue this causes to users. Is there
> more to it - does this go hand in hand with JDK version and Ammonite, or
> are those separate? What's an example of the practical user issue. Like, I
> compile vs Spark 3.4.0 and because of Scala version differences it doesn't
> run on some vendor distro? That's not great, but seems like a vendor
> problem. Unless you tell me we are getting tons of bug reports to OSS Spark
> as a result or something.
>
> Is the implication that something in OSS Spark is being blocked to prefer
> some set of vendor choices? because the changes you're pointing to seem to
> be going into Apache Spark, actually. It'd be more useful to be specific
> and name names at this point, seems fine.
>
> The rest of this is just a discussion about Databricks choices. (If it's
> not clear, I'm at Databricks but do not work on the Spark distro). We can
> discuss but it seems off-topic _if_ it can't be connected to a problem for
> OSS Spark. Anyway:
>
> If it helps, _some_ important patches are described at
> https://docs.databricks.com/release-notes/runtime/maintenance-updates.html
> ; I don't think this is exactly hidden.
>
> Out of curiosity, how would you describe this software in the UI instead?
> "3.4.0" is shorthand, because this is a little dropdown menu; the terminal
> output is likewise not a place to list all patches. You would propose
> requesting calling this "3.4.0 + patches"? That's the best I can think of,
> but I don't think it addresses what you're getting at anyway. I think you'd
> just prefer Databricks make a different choice, which is legitimate, but,
> an issue to take up with Databricks, not here.
>
>
> On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Sean.
>>
>> "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you
>> mentioned. For the record, I also didn't bring up any old story here.
>>
>> > "Apache Spark 3.4.0 + patches"
>>
>> However, "including Apache Spark 3.4.0" still causes confusion even in a
>> different way because of those missing patches, SPARK-40436 (Upgrade Scala
>> to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically,
>> Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to
>> the users.
>>
>> [image: image.png]
>>
>> It's a sad story from the Apache Spark Scala perspective because the
>> users cannot even try to use the correct Scala 2.12.17 version in the
>> runtime.
>>
>> All items I've shared are connected via a single theme, hurting Apache
>> Spark Scala users.
>> From (1) building Spark, (2) creating a fragmented Scala Spark runtime
>> environment and (3) hidden user-facing documentation.
>>
>> Of course, I don't think those are designed in an organized way
>> intentionally. It just happens at the same time.
>>
>> Based on your comments, let me ask you two questions. (1) When Databricks
>> builds its internal Spark from its private code repository, is it a company
>> policy to always expose "Apache 3.4.0" to the users like the following by
>> ignoring all changes (whatever they are). And, (2) Do you insist that it is
>> normative and clear to the users and the community?
>>
>> > - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
>> version 3.4.0"
>> > - UI shows Apache Spark logo and `3.4.0`.
>>
>>>
>>>

Reply via email to