from:"Sean Owen"

Re: Using bundler for Jekyll?

2021-02-12 Thread Sean Owen

Seems fine to me. How about just regenerating the whole site once with the
latest version and requiring that?

On Fri, Feb 12, 2021 at 7:09 AM attilapiros 
wrote:

> I run into the same problem today and tried to find the version where the
> diff is minimal, so I wrote a script:
>
> ```
> #!/bin/zsh
>
> versions=('3.7.3' '3.7.2' '3.7.0' '3.6.3' '3.6.2' '3.6.1' '3.6.0' '3.5.2'
> '3.5.1' '3.5.0' '3.4.5' '3.4.4' '3.4.3' '3.4.2' '3.4.1' '3.4.0')
>
> for i in $versions; do
>   gem uninstall -a -x jekyll rouge
>   gem install jekyll --version $i
>   jekyll build
>   git diff --stat
>   git reset --hard HEAD
> done
> ```
>
> Based on this the best version is: jekyll-3.6.3:
>
> ```
> site/community.html |  2 +-
>  site/sitemap.xml| 14 +++---
>  2 files changed, 8 insertions(+), 8 deletions(-)
> ```
>
> What about changing the README.md [1] and specifying this exact version?
>
> Moreover changing the command to install it to:
>
> ```
>  gem install jekyll --version 3.6.3
> ```
>
> This installs the right rouge version as it is dependency.
>
> Finally I would give this command too as prequest:
>
> ```
>   gem uninstall -a -x jekyll rouge
> ```
>
> Because gem keeps all the installed versions and only one is used.
>
>
> [1]
>
> https://github.com/apache/spark-website/blob/6a5fc2ccaa5ad648dc0b25575ff816c10e648bdf/README.md#L5
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Sean Owen

Sounds like a fine time to me, sure.

On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As of today, `branch-3.0` has 307 patches (including 25 correctness
> patches) since v3.0.1 tag (released on September 8th, 2020).
>
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
>
> What do you think about the Apache Spark 3.0.2 release?
>
> Bests,
> Dongjoon.
>
>
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with the
> unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
> itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of
> decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
>

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Sean Owen

I think I'm +1 on this, in that I don't see any more test failures than I
usually do, and I think they're due to my local env, but is anyone seeing
these failures?
- includes jars passed in through --jars *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- includes jars passed in through --packages *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- includes jars passed through spark.jars.packages and
spark.jars.repositories *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- correctly builds R packages included in a jar with --packages !!! IGNORED
!!!
- include an external JAR in SparkR *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)



- SPARK-8368: includes jars passed in through --jars *** FAILED ***
  spark-submit returned with exit code 1.
  Command line: './bin/spark-submit' '--class'
'org.apache.spark.sql.hive.SparkSubmitClassLoaderTest' '--name'
'SparkSubmitClassLoaderTest' '--master' 'local-cluster[2,1,1024]' '--conf'
'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false'
'--driver-java-options' '-Dderby.system.durability=test' '--jars'
'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-d5238bae-f0c8-4e26-8e0d-e7fc3a830de4/testJar-1613607380770.jar,file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-5007fa06-28c3-4816-afe0-09f5885a201c/testJar-1613607380989.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-contrib-2.3.7.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-hcatalog-core-2.3.7.jar'
'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-27131c51-8387-455a-a23d-e5c41e5448a3/testJar-1613607380546.jar'
'SparkSubmitClassA' 'SparkSubmitClassB'



 external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured)
(BaseYarnClusterSuite.scala:199)


On Tue, Feb 16, 2021 at 1:52 AM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.2.
>
> The vote is open until February 19th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.0.2-rc1 (commit
> 648457905c4ea7d00e3d88048c63f360045f0714):
> https://github.com/apache/spark/tree/v3.0.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1366/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>
> The list of bug fixes going into 3.0.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.2?
> ===
>
> The current list of open tickets targeted at 3.0.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Sean Owen

I'm on Ubuntu 20, Java 8, Maven, with most every profile enabled (Hive,
YARN, Mesos, K8S, SparkR, etc). I think it's probably transient or specific
to my env; just checking if anyone else sees this. Obviously the main test
builds do not fail on Jenkins.

On Wed, Feb 17, 2021 at 10:47 PM Dongjoon Hyun 
wrote:

> I didn't see them. Could you describe your environment: OS, Java,
> Maven/SBT, profiles?
>
> On Wed, Feb 17, 2021 at 6:26 PM Sean Owen  wrote:
>
>> I think I'm +1 on this, in that I don't see any more test failures than I
>> usually do, and I think they're due to my local env, but is anyone seeing
>> these failures?
>> - includes jars passed in through --jars *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - includes jars passed in through --packages *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - includes jars passed through spark.jars.packages and
>> spark.jars.repositories *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - correctly builds R packages included in a jar with --packages !!!
>> IGNORED !!!
>> - include an external JAR in SparkR *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>>
>>
>>
>> - SPARK-8368: includes jars passed in through --jars *** FAILED ***
>>   spark-submit returned with exit code 1.
>>   Command line: './bin/spark-submit' '--class'
>> 'org.apache.spark.sql.hive.SparkSubmitClassLoaderTest' '--name'
>> 'SparkSubmitClassLoaderTest' '--master' 'local-cluster[2,1,1024]' '--conf'
>> 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false'
>> '--driver-java-options' '-Dderby.system.durability=test' '--jars'
>> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-d5238bae-f0c8-4e26-8e0d-e7fc3a830de4/testJar-1613607380770.jar,file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-5007fa06-28c3-4816-afe0-09f5885a201c/testJar-1613607380989.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-contrib-2.3.7.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-hcatalog-core-2.3.7.jar'
>> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-27131c51-8387-455a-a23d-e5c41e5448a3/testJar-1613607380546.jar'
>> 'SparkSubmitClassA' 'SparkSubmitClassB'
>>
>>
>>
>>  external shuffle service *** FAILED ***
>>   FAILED did not equal FINISHED (stdout/stderr was not captured)
>> (BaseYarnClusterSuite.scala:199)
>>
>>
>> On Tue, Feb 16, 2021 at 1:52 AM Dongjoon Hyun 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.2.
>>>
>>> The vote is open until February 19th 9AM (PST) and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.2-rc1 (commit
>>> 648457905c4ea7d00e3d88048c63f360045f0714):
>>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>>
>>> The list of bug fixes going into 3.0.2 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, y

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Sean Owen

I think it's OK to raise particular instances. It's hard for me to evaluate
further in the abstract.

I don't think we use Assignee much at all, except to kinda give credit when
something is done. No piece of code or work can be solely owned by one
person; this is just ASF policy.

I think we've seen the occasional opposite case too: someone starts working
on an issue, and then someone else also starts working on it with a
competing fix or change.

These are ultimately issues of communication. If an issue is pretty
stalled, and you have a proposal, nothing wrong with just going ahead with
a proposal. There may be no disagreement. It might result in the
other person joining your PR. As I say, not sure if there's a deeper issue
than that if even this hasn't been tried?

On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim 
wrote:

> Thanks for the input, Hyukjin!
>
> I have been keeping my own policy among all discussions I have raised - I
> would provide the hypothetical example closer to the actual one and avoid
> pointing out directly. The main purpose of the discussion is to ensure our
> policy / consensus makes sense, no more. I can provide a more detailed
> explanation if someone feels the explanation wasn't sufficient to
> understand.
>
> Probably this discussion could play as a "reminder" to every committers if
> similar discussion was raised before and it succeeded to build consensus.
> If there's some point we don't build consensus yet, it'd be a good time to
> discuss further. I don't know what exactly was the discussion and the
> result so what is new here, but I guess this might be a duplicated one as
> you say similar issue.
>
>
>
> On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon  wrote:
>
>> I remember I raised a similar issue a long time ago in the dev mailing
>> list. I agree that setting no assignee makes sense in most of the cases,
>> and also think we share similar thoughts about the assignee on
>> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
>> etc.
>> It makes me think that the actual issue by setting an assignee happens
>> rarely, and it is an issue to several specific cases that would need a look
>> case-by-case.
>> Were there specific cases that made you concerned?
>>
>>
>> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
>> 작성:
>>
>>> Hi devs,
>>>
>>> I'd like to raise a discussion and hear voices on the "assignee"
>>> practice on committers which may lead issues on preemption.
>>>
>>> I feel this is the one of major unfairnesses between contributors and
>>> committers if used improperly, especially when someone assigns themselves
>>> with multiple JIRA issues.
>>>
>>> Let's say there're features A and B, which may take a month for each (or
>>> require design doc) - both are individual major features, not subtasks or
>>> some sort of "follow-up".
>>>
>>> Technically, committers can file two JIRA issues and assign both of
>>> issues, "without actually doing no progress", and implicitly ensure no one
>>> works on these issues for a couple of months. Even just a plan on backlog
>>> can prevent others from taking up.
>>>
>>> I don't think this is fair with contributors, because contributors don't
>>> tend to file an JIRA issue unless they made a lot of progress. (I'd like to
>>> remind you, competition from contributor's position is quite tense and
>>> stressful.) Say they already spent a month working on it and testing it in
>>> production. They feel ready and visit JIRA, and realize the JIRA issue was
>>> made and assigned to someone, while there's no progress on the JIRA issue.
>>> No idea how much progress "someone" makes. They "might" ask about the
>>> progress, but nothing will change if "someone" simply says "I'm still
>>> working on this" (with even 1% of progress). Isn't this actually against
>>> the reason we don't allow setting assignee to contributor?
>>>
>>> For sure, assigning the issue would make sense if the issue is a subtask
>>> or follow-up, or the issue made explicit progress like design doc is being
>>> put. In other cases I don't see any reason assigning the issue explicitly.
>>> Someone may say to contributors, just leave a comment "I'm working on it",
>>> but isn't it also something committers can also do when they are "actually"
>>> working?
>>>
>>> I think committers should have no advantage on the possible competition
>>> on contribution, and setting assignee without explicit progress makes me
>>> worried.
>>> To make it fair, either we should allow contributors to assign them or
>>> don't allow committers to assign them unless extreme cases - they can still
>>> use the approach contributors do.
>>> (Again I'd feel OK to assign if there's a design doc proving that they
>>> really spent non-trivial effort already. My point is preempting JIRA issues
>>> with only sketched ideas or even just rationalizations.)
>>>
>>> Would like to hear everyone's voices.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. better yet, probably it's better the

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Sean Owen

Holden is absolutely correct - pinging relevant individuals is probably
your best bet. I skim the 40-50 PRs that have activity each day and look
into a few that look like I would know something about by the title, but,
easy to miss something I could weigh in on.

There is no way to force people to review or commit something of course.
And keep in mind we get a lot of, shall we say, unuseful pull requests.
There is occasionally some blowback to closing someone's PR, so the path of
least resistance is often the timeout / 'soft close'. That is, it takes a
lot more time to satisfactorily debate down the majority of PRs that
probably shouldn't get merged, and there just isn't that much bandwidth.
That said of course it's bad if lots of good PRs are getting lost in the
shuffle and I am sure there are some.

One other aspect is that a committer is taking some degree of
responsibility for merging a change, so the ask is more than just a few
minutes of eyeballing. If it breaks something the merger pretty much owns
resolving it, and, the whole project owns any consequence of the change for
the future.

I think it might just be committers that can reopen at this point, not sure
if that changed. But you'll probably need someone's attention anyway to
make progress.

Without knowing specific PRs, I can't say whether there was a good reason,
bad reason, or no particular reason it wasn't engaged. I think it's OK to
float a PR or two you really believe should have gotten attention to dev@,
but yeah in the end you need to find the person who has most touched that
code really.

The general advice from https://spark.apache.org/contributing.html is still
valuable. Clear fixes are easier to say 'yes' to than big refactorings.
Improving docs, tests, existing features is better than adding big new
things, etc.

On Thu, Feb 18, 2021 at 8:58 AM Enrico Minack 
wrote:

> Hi Spark Developers,
>
> I have a fundamental question on the process of contributing to Apache
> Spark from outside the circle of committers.
>
> I have gone through a number of pull requests and I always found it hard
> to get feedback, especially from committers. I understand there is a very
> high competition for getting attention of those few committers. Given
> Spark's code base is so huge, only very few people will feel comfortable
> approving code changes for a specific code section. Still, the motivation
> of those that want to contribute suffers from this.
>
> In particular I am getting annoyed by the auto-closing PR feature on
> GitHub. I understand the usefulness of this feature for such a frequent
> project, but I personally am impacted by the weaknesses of this approach. I
> hope, this can be improved.
>
> The feature first warns in advance that it is "... closing this PR because
> it hasn't been updated in a while". This comment looks a bit silly in
> situations where the contributor is waiting for committers' feedback.
>
> *What is the approved way to ...*
>
> *... prevent it from being auto-closed?* Committing and commenting to the
> PR does not prevent it from being closed the next day.
> *...** re-open it? *The comment says "If you'd like to revive this PR,
> please reopen it ...", but there is no re-open button anywhere on the PR!
>
> *... remove the Stale tag?* The comment says "...  ask a committer to
> remove the Stale tag!". Where can I find a list of committers and their
> contact details? What is the best way to contact them? E-Mail? Mentioning
> them in a PR comment?
>
> *... find the right committer to review a PR?* The contributors page
> states "ping likely reviewers", but it does not state how to identify
> likely reviewers. Do you recommend git-blaming the relevant code section?
> What if those committers are not available any more? Whom to ask next?
>
> *... contact committers to get their attention?* Cc'ing them in PR
> comments? Sending E-Mails? Doesn't that contribute to their cognitive load?
>
> What is the expected contributor's response to a PR that does not get
> feedback? Giving up?
>
> Are there processes in place to increase the probability PRs do not get
> forgotten, auto-closed and lost?
>
>
> This is not about my specific pull requests or reviewers of those. I
> appreciate their time and engagement.
> This is about the general process of getting feedback and needed
> improvements for it in order to increase contributor community happiness.
>
> Cheers,
> Enrico
>

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Sean Owen

I don't believe Assignee has ever been used for anything except to give a
bit of informal credit to the person who drove most of the work on the
issue, when it's resolved.
If that's the question - does Assignee mean only that person can work on
the issue? then no, it has never meant that.

You say you have an example, one that was resolved. Is this a single case
or systemic? I don't think I recall seeing problems of this form.

We _have_ had multiple incompatible PRs for a JIRA before, occasionally.
We have also definitely had people file huge umbrella JIRAs, parts of which
_nobody_ ever completes, but, for lack of any interest from the filer or
anyone else.

I think it's fair to give a person a reasonable shot at producing a
solution if they propose a problem or feature.
We have had instances where a new contributor files a relatively simple
issue, and finds another contributor opened the obvious PR before they had
a chance (maybe they needed a day to get the PR together). That seemed a
bit discourteous.

 If you need a solution as well, and one isn't forthcoming, just open a PR
and propose your own? I don't hear that anyone told you not to, but I also
don't know what this is about. You can always propose a PR as an
alternative to compare with, to facilitate collaboration. Nothing wrong
with that.

On Thu, Feb 18, 2021 at 10:45 PM Jungtaek Lim 
wrote:

> (Actually the real world case was fixed somehow and I wouldn't like to
> point out a fixed one. I just would like to make sure what I think is
> correct and is considered as "consensus".)
>
> Just consider the case as simple - someone files two different JIRA issues
> for new features and assigns to him/herself altogether, without sharing
> anything about the ongoing efforts someone has made. (So you have no idea
> even someone just files two different JIRA issues without "any" progress
> and has them in a backlog.) The new features are not new and are likely
> something others could work in parallel.
>
> That said, committers can explicitly represent "I'm working on this so
> please refrain from making redundant efforts." via assigning the issue,
> which is actually similar to the comment "I'm working on this".
> Unfortunately, this works only when the feature is something one who filed
> a JIRA issue works uniquely. Occasional opposite cases aren't always a
> notion of ignoring the signal of "I'm working on this". There're also
> coincidences two different individuals/teams are working on exactly the
> same at the same time.
>
> My concern is that "assignment" might be considered pretty much stronger
> than just commenting "I'm working on this" - it's like "Regardless of your
> current progress, I started working on this so don't consider your effort
> to be proposable. You should have filed the JIRA issue before I file one."
> Is it possible for contributors to do the same? I guess not.
>
> The other problem is the multiple assignments in parallel. I wouldn't like
> to guess someone over-uses the power of assignments, but technically it's
> simply possible that someone can file JIRA issues on his/her backlog which
> can be done in a couple of months or so with assigning to him/herself,
> which effectively blocks others from working or proposing the same. I
> consider this as preemptive which sounds bad and even unfair.
>
> On Fri, Feb 19, 2021 at 12:14 AM Sean Owen  wrote:
>
>> I think it's OK to raise particular instances. It's hard for me to
>> evaluate further in the abstract.
>>
>> I don't think we use Assignee much at all, except to kinda give credit
>> when something is done. No piece of code or work can be solely owned by one
>> person; this is just ASF policy.
>>
>> I think we've seen the occasional opposite case too: someone starts
>> working on an issue, and then someone else also starts working on it with a
>> competing fix or change.
>>
>> These are ultimately issues of communication. If an issue is pretty
>> stalled, and you have a proposal, nothing wrong with just going ahead with
>> a proposal. There may be no disagreement. It might result in the
>> other person joining your PR. As I say, not sure if there's a deeper issue
>> than that if even this hasn't been tried?
>>
>> On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks for the input, Hyukjin!
>>>
>>> I have been keeping my own policy among all discussions I have raised -
>>> I would provide the hypothetical example closer to the actual one and avoid
>>> pointi

Re: Java Code Style

2021-02-20 Thread Sean Owen

Do you just mean you want to adjust the code style rules? Yes you can do
that in IJ, just a matter of finding the indent rule to adjust.
The Spark style is pretty normal stuff, though not 100% consistent.I prefer
the first style in this case. Sometimes it's a matter of judgment when to
differ from a standard style for better readability.

On Sat, Feb 20, 2021 at 8:53 AM Pis Kevin  wrote:

> Hi,
>
>
>
> I use google java code style in intellj idea. But when I reformat the
> following codes, its  inconsistent with the code in spark.
>
>
>
> Before reformat:
>
>
>
> After reformat:
>
>
>
>
>
> Why? And how to fix the issue.
>

Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-22 Thread Sean Owen

+1 LGTM, same results as last time. Does anyone see the error below? It is
probably env-specific as the Jenkins jobs don't hit this. Just checking.

 SPARK-29604 external listeners should be initialized with Spark
classloader *** FAILED ***
  java.lang.RuntimeException: [download failed:
tomcat#jasper-compiler;5.5.23!jasper-compiler.jar, download failed:
tomcat#jasper-runtime;5.5.23!jasper-runtime.jar, download failed:
commons-el#commons-el;1.0!commons-el.jar, download failed:
org.apache.hive#hive-exec;2.3.7!hive-exec.jar]
  at
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1420)
  at
org.apache.spark.sql.hive.client.IsolatedClientLoader$.$anonfun$downloadVersion$2(IsolatedClientLoader.scala:122)
  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
  at
org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:122)
  at
org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:64)
  at
org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:63)
  at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:439)
  at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:352)
  at
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:71)
  at
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:70)

On Mon, Feb 22, 2021 at 12:57 AM Hyukjin Kwon  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.1.
>
> The vote is open until February 24th 11PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.1.1-rc3 (commit
> 1d550c4e90275ab418b9161925049239227f3dc9):
> https://github.com/apache/spark/tree/v3.1.1-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> 
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1367
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>
> The list of bug fixes going into 3.1.1 can be found at the following URL:
> https://s.apache.org/41kf2
>
> This release is using the release script of the tag v3.1.1-rc3.
>
> FAQ
>
> ===
> What happened to 3.1.0?
> ===
>
> There was a technical issue during Apache Spark 3.1.0 preparation, and it
> was discussed and decided to skip 3.1.0.
> Please see
> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
> more details.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.1?
> ===
>
> The current list of open tickets targeted at 3.1.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-23 Thread Sean Owen

Yes, committers are added regularly. I don't think that changes the
situation for most PRs that perhaps just aren't suitable to merge.
Again the best thing you can do is make it as easy to merge as possible and
find people who have touched the code for review. This often works out.

On Tue, Feb 23, 2021 at 4:06 AM Enrico Minack 
wrote:

> Am 18.02.21 um 16:34 schrieb Sean Owen:
> > One other aspect is that a committer is taking some degree of
> > responsibility for merging a change, so the ask is more than just a
> > few minutes of eyeballing. If it breaks something the merger pretty
> > much owns resolving it, and, the whole project owns any consequence of
> > the change for the future.
>
> I think this explains the hesitation pretty well: Committers take
> ownership of the change. It is understandable that PRs then have to be
> very convincing with low risk/benefit ratio.
>
> Are there plans or initiatives to proactively widen the base of
> committers to mitigate the current situation?
>
> Enrico
>
>

Re: K8s integration test failure ("credentials Jenkins is using is probably wrong...")

2021-02-23 Thread Sean Owen

Shane would you know? May be a problem with a single worker.

On Tue, Feb 23, 2021 at 8:46 AM Phillip Henry 
wrote:

>
> Hi,
>
> Silly question: the Jenkins build for my PR is failing but it seems
> outside of my control. What must I do to remedy this?
>
> I've submitted
>
> https://github.com/apache/spark/pull/31535
>
> but Spark QA is telling me "Kubernetes integration test status failure".
>
> The Jenkins job says "SUCCESS" but also barfs with:
>
> FileNotFoundException means that the credentials Jenkins is using is probably 
> wrong. Or the user account does not have write access to the repo.
>
>
> See
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull
>
> Can anybody please advise?
>
> Thanks in advance.
>
> Phillip
>
>
>

Re: Apache Spark 3.2 Expectation

2021-02-25 Thread Sean Owen

I'd roughly expect 3.2 in, say, July of this year, given the usual cadence.
No reason it couldn't be a little sooner or later. There is already some
good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Sean Owen

For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep 2019
(2.3.4), or about 19 months. The 2.4 branch should probably be maintained
longer than that, as the final 2.x branch. 2.4.0 was released in Nov 2018.
A final release in, say, April 2021 would be about 30 months. That feels
about right timing-wise.

We should in any event release 2.4.8, yes. We can of course choose to
release a 2.4.9 if some critical issue is found, later.

But yeah based on the velocity of back-ports to 2.4.x, it seems about time
to call it EOL.

Sean


On Wed, Mar 3, 2021 at 12:05 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> We successfully completed Apache Spark 3.1.1 and 3.0.2 releases and
> started 3.2.0 discussion already.
>
> Let's talk about branch-2.4 because there exists some discussions on JIRA
> and GitHub about skipping backporting to 2.4.
>
> Since `branch-2.4` has been maintained well as LTS, I'd like to suggest
> having Apache Spark 2.4.8 release as the official EOL release of 2.4 line
> in order to focus on 3.x more from now. Please note that `branch-2.4` will
> be frozen officially like `branch-2.3` after EOL release.
>
> - Apache Spark 2.4.0 was released on November 2, 2018.
> - Apache Spark 2.4.7 was released on September 12, 2020.
> - Since v2.4.7 tag, `branch-2.4` has 134 commits including the following
> 12 correctness issues.
>
> ## CORRECTNESS ISSUE
> SPARK-30201 HiveOutputWriter standardOI should use
> ObjectInspectorCopyOption.DEFAULT
> SPARK-30228 Update zstd-jni to 1.4.4-3
> SPARK-30894 The nullability of Size function should not depend on
> SQLConf.get
> SPARK-32635 When pyspark.sql.functions.lit() function is used with
> dataframe cache, it returns wrong result
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table
> itself may not be cached
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-34187 Use available offset range obtained during polling when
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of
> decimal type in hive, spark reads incorrect value
> SPARK-34229 Avro should read decimal values with the file schema
>
> ## SECURITY ISSUE
> SPARK-3 Upgrade Jetty to 9.4.28.v20200408
> SPARK-33831 Update to jetty 9.4.34
> SPARK-34449 Upgrade Jetty to fix CVE-2020-27218
>
> What do you think about this?
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Sean Owen

Sure, I'm even arguing that 2.4.8 could possibly be the final release. No
objection of course to continuing to backport to 2.4.x where appropriate
and cutting 2.4.9 later in the year as a final EOL release, either.

On Wed, Mar 3, 2021 at 12:59 PM Dongjoon Hyun 
wrote:

> Thank you, Sean.
>
> Ya, exactly, we can release 2.4.8 as a normal release first and use 2.4.9
> as the EOL release.
>
> Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
> terms of the cadence.
>
> Bests,
> Dongjoon.
>
>
> On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:
>
>> For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep
>> 2019 (2.3.4), or about 19 months. The 2.4 branch should probably be
>> maintained longer than that, as the final 2.x branch. 2.4.0 was released in
>> Nov 2018. A final release in, say, April 2021 would be about 30 months.
>> That feels about right timing-wise.
>>
>> We should in any event release 2.4.8, yes. We can of course choose to
>> release a 2.4.9 if some critical issue is found, later.
>>
>> But yeah based on the velocity of back-ports to 2.4.x, it seems about
>> time to call it EOL.
>>
>> Sean
>>
>

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Sean Owen

I like koalas a lot. Playing devil's advocate, why not just let it continue
to live as an add on? Usually the argument is it'll be maintained better in
Spark but it's well maintained. It adds some overhead to maintaining Spark
conversely. On the upside it makes it a little more discoverable. Are there
more 'synergies'?

On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to start the discussion on supporting pandas API layer on
> Spark.
>
>
>
> If we have a general consensus on having it in PySpark, I will initiate
> and drive an SPIP with a detailed explanation about the implementation’s
> overview and structure.
>
> I would appreciate it if I can know whether you guys support this or not
> before starting the SPIP.
> What do you want to propose?
>
> I have been working on the Koalas 
> project that is essentially: pandas API support on Spark, and I would like
> to propose embracing Koalas in PySpark.
>
>
>
> More specifically, I am thinking about adding a separate package, to
> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
> the existing codes. The overview would look as below:
>
> pyspark_dataframe.[... PySpark APIs ...]
> pandas_dataframe.[... pandas APIs (local) ...]
>
> # The package names will change in the final proposal and during review.
> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
> koalas_dataframe.[... pandas APIs on Spark ...]
>
> pyspark_dataframe = koalas_dataframe.to_spark()
> pandas_dataframe = koalas_dataframe.to_pandas()
>
> Koalas provides a pandas API layer on PySpark. It supports almost the same
> API usages. Users can leverage their existing Spark cluster to scale their
> pandas workloads. It works interchangeably with PySpark by allowing both
> pandas and PySpark APIs to users.
>
> The project has grown separately more than two years, and this has been
> successfully going. With version 1.7.0 Koalas has greatly improved maturity
> and stability. Its usability has been proven with numerous users’ adoptions
> and by reaching more than 75% API coverage in pandas’ Index, Series and
> DataFrame.
>
> I strongly think this is the direction we should go for Apache Spark, and
> it is a win-win strategy for the growth of both Apache Spark and pandas.
> Please see the reasons below.
> Why do we need it?
>
>-
>
>Python has grown dramatically in the last few years and became one of
>the most popular languages, see also StackOverFlow trend
>
>for Python, Java, R and Scala languages.
>-
>
>pandas became almost the standard library of data science. Please also
>see the StackOverFlow trend
>
>for pandas, Apache Spark and PySpark.
>-
>
>PySpark is not Pythonic enough. At least I myself hear a lot of
>complaints. That initiated Project Zen
>, and we have
>greatly improved PySpark usability and made it more Pythonic.
>
> Nevertheless, data scientists tend to prefer pandas libraries according to
> the trends but APIs are hard to change in PySpark. We should redesign all
> APIs and improve them from scratch, which is very difficult.
>
> One straightforward and fast approach is to benchmark a successful case,
> and pandas does not support distributed execution. Once PySpark supports
> pandas-like APIs, it can be a good option for pandas users to scale their
> workloads easily. I do believe this is a win-win strategy for the growth of
> both pandas and PySpark.
>
> In fact, there are already similar tries such as Dask 
> and Modin  (other than Koalas
> ). They are all growing fast and
> successfully, and I find that people compare it to PySpark from time to
> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
> data technologies battling head to head
> 
> .
>
>
>
>-
>
>There are many important features missing that are very common in data
>science. One of the most important features is plotting and drawing a
>chart. Almost every data scientist plots and draws a chart to understand
>their data quickly and visually in their daily work but this is missing in
>PySpark. Please see one example in pandas:
>
>
>
>
> I do recommend taking a quick look for blog posts and talks made for
> pandas on Spark:
> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
> They explain why we need this far more better.
>
>

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Sean Owen

Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
profiles enabled.
I still get an odd failure in the Hive versions suite, but I keep seeing
that in my env and think it's something odd about my setup.
+1

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-12 Thread Sean Owen

+1 same result as last RC for me.

On Mon, Apr 12, 2021, 12:53 AM Liang-Chi Hsieh  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.8.
>
> The vote is open until Apr 15th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.8
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.8 (try project = SPARK AND
> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.8-rc2 (commit
> a0ab27ca6b46b8e5a7ae8bb91e30546082fc551c):
> https://github.com/apache/spark/tree/v2.4.8-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1373/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/
>
> The list of bug fixes going into 2.4.8 can be found at the following URL:
> https://s.apache.org/spark-v2.4.8-rc2
>
> This release is using the release script of the tag v2.4.8-rc2.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.8?
> ===
>
> The current list of open tickets targeted at 2.4.8 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.8
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: mvn auto-downloading on fresh clone

2021-04-21 Thread Sean Owen

I agree, it looks like the automatic redirector has changed behavior. It
still sends you to an HTML page for the mirror, but previously that link
would cause it to redirect straight to the download.
While the script can fallback to archive.apache.org, it doesn't because the
HTML downloads successfully -- just is not the distribution!
Either we detect this or have to hack this more to get the mirror URL from
the redirector, then attach it to the path.

On Wed, Apr 21, 2021 at 12:51 PM Bruce Robbins 
wrote:

> Is it just me, or does the auto-download of maven on a fresh Spark clone
> no longer work? It looks like
> https://www.apache.org/dyn/closer.lua?action=download&filename= is not
> functioning anymore (or for the moment) for any piece of Apache software.
>
> I noted this in https://issues.apache.org/jira/browse/SPARK-35178.
>
> I tried on two unrelated networks, so I don't think I am being rate
> limited.
>
> When I changed the mirror in build/mvn to the direct download page
> (https://https://downloads.apache.org), the build worked. I assume there
> is a good reason build/mvn doesn't use that url (I suppose the current url
> chooses a mirror, maybe?).
>

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Sean Owen

+1 from me too, same result as last time.

On Wed, Apr 28, 2021 at 11:33 AM Liang-Chi Hsieh  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.8.
>
> The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.8
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.8 (try project = SPARK AND
> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.8-rc3 (commit
> e89526d2401b3a04719721c923a6f630e555e286):
> https://github.com/apache/spark/tree/v2.4.8-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1377/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-docs/
>
> The list of bug fixes going into 2.4.8 can be found at the following URL:
> https://s.apache.org/spark-v2.4.8-rc3
>
> This release is using the release script of the tag v2.4.8-rc3.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.8?
> ===
>
> The current list of open tickets targeted at 2.4.8 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.8
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Should we add built in support for bouncy castle EC w/Kube

2021-04-29 Thread Sean Owen

I recall that Bouncy Castle has some crypto export implications. If it's in
the distro then I think we'd have to update
https://www.apache.org/licenses/exports/ to reflect that Bouncy Castle is
again included in the product. But that's doable. Just have to recall how
one updates that.

On Thu, Apr 29, 2021 at 1:08 PM Holden Karau  wrote:

> Hi Folks,
>
> I've deployed a new version of K3s locally and I ran into an issue
> with the key format not being supported out of the box. We delegate to
> fabric8 which has bouncy castle EC as an optional dependency. Adding
> it would add ~6mb to the Kube jars. What do folks think?
>
> Cheers,
>
> Holden
>
> P.S.
>
> If you're running K3s in your lab as well and get "Exception in thread
> "main" io.fabric8.kubernetes.client.KubernetesClientException:
> JcaPEMKeyConverter is provided by BouncyCastle, an optional
> dependency. To use support for EC Keys you must explicitly add this
> dependency to classpath." I worked around it by adding
>
> https://repo1.maven.org/maven2/org/bouncycastle/bcpkix-jdk15on/1.68/bcpkix-jdk15on-1.68.jar
> &
> 
> https://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk15on/1.68/bcprov-jdk15on-1.68.jar
> to my class path.
>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [apache/spark-website] Update contributing to include code of conduct section (#335)

2021-05-04 Thread Sean Owen

Just FYI - proposed update to the CoC for the project. Looks reasonable to
simply adopt the ASF code of conduct, per the PR.

On Tue, May 4, 2021 at 2:02 AM Jungtaek Lim 
wrote:

> I think the rationalization is great, but why not going through dev@
> mailing list? Many contributors are subscribing dev@ mailing list as well
> and it would be also a good time to remind the CoC from your
> idea/discussion thread.
>
> I assume getting consensus to add here is just a matter of time (as CoC is
> already something ASF project requires to us, and we just make it explicit
> here), but might be ideal to reach more audiences.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread Sean Owen

It looks like the repository is "open" - it doesn't publish until "closed"
after all artifacts are uploaded. Is that it?
Otherwise +1 from me.

On Mon, May 10, 2021 at 1:10 AM Liang-Chi Hsieh  wrote:

> Yea, I don't know why it happens.
>
> I remember RC1 also has the same issue. But RC2 and RC3 don't.
>
> Does it affect the RC?
>
>
> John Zhuge wrote
> > Got this error when browsing the staging repository:
> >
> > 404 - Repository "orgapachespark-1383 (staging: open)"
> > [id=orgapachespark-1383] exists but is not exposed.
> >
> > John Zhuge
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Sean Owen

Hm, yes I see it at
http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee&fingerprint=on&op=index
but not on keyserver.ubuntu.com for some reason.
What happens if you try to close it again, perhaps even manually in the UI
there? I don't want to click it unless it messes up the workflow

On Tue, May 11, 2021 at 11:34 AM Liang-Chi Hsieh  wrote:

> I did upload my public key in
> https://dist.apache.org/repos/dist/dev/spark/KEYS.
> I also uploaded it to public keyserver before cutting RC1.
>
> I just also try to search the public key and can find it.
>
>
>
> cloud0fan wrote
> > [image: image.png]
> >
> > I checked the log in https://repository.apache.org/#stagingRepositories,
> > seems the gpg key is not uploaded to the public keyserver. Liang-Chi can
> > you take a look?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Sean Owen

Is there a separate process that pushes to maven central? That's what we
have to have in the end.

On Tue, May 11, 2021, 12:31 PM Liang-Chi Hsieh  wrote:

> I don't know what will happens if I manually close it now.
>
> Not sure if the current status cause a problem? If not, maybe leave as it
> is?
>
>
> Sean Owen-2 wrote
> > Hm, yes I see it at
> >
> http://pool.sks-keyservers.net/pks/lookup?search=0x653c2301fea493ee&fingerprint=on&op=index
> > but not on keyserver.ubuntu.com for some reason.
> > What happens if you try to close it again, perhaps even manually in the
> UI
> > there? I don't want to click it unless it messes up the workflow
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Resolves too old JIRAs as incomplete

2021-05-19 Thread Sean Owen

I agree. Such old JIRAs are 99% obsolete. If anyone objects to a particular
issue being closed, they can comment and we can reopen. It's a very
reversible thing. There is value in keeping JIRA up to date with reality.

On Wed, May 19, 2021 at 8:47 PM Takeshi Yamamuro 
wrote:

> Hi, dev,
>
> As you know, we have too many open JIRAs now:
> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
> Progress", Reopened)'
>
> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
> JIRAs
> for making the JIRAs manageable.
>
> As Hyukjin did the same action two years ago (for details, see:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
> I'm planning to use a similar JQL below to close them:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND
> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
> AND updated <= -52w
>
> The total number of matched JIRAs is 741.
> Or, we might be able to close them more aggressively by removing the
> version condition:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND updated
> <= -52w
>
> The matched number is 1484 (almost half of the current open JIRAs).
>
> If there is no objection, I'd like to do it next week or later.
> Any thoughts?
>
> Bests,
> Takeshi
> --
> ---
> Takeshi Yamamuro
>

Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Sean Owen

+1 same result as in previous tests

On Mon, May 24, 2021 at 1:14 AM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.2.
>
> The vote is open until May 27th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.1.2-rc1 (commit
> de351e30a90dd988b133b3d00fa6218bfcaba8b8):
> https://github.com/apache/spark/tree/v3.1.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1384/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-docs/
>
> The list of bug fixes going into 3.1.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349602
>
> This release is using the release script of the tag v3.1.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.2?
> ===
>
> The current list of open tickets targeted at 3.1.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

How to think about SparkPullRequestBuilder-K8s?

2021-06-11 Thread Sean Owen

I find that somewhat often, the K8S PR builders will fail on a PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/

... when the PR seems totally unrelated to K8S. I've kind of learned to
ignore them in that case but that seems wrong. Are they just kind of flaky?
am I imagining things? Just trying to figure out how much they're
'accurate' in catching real vs false failures.

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-17 Thread Sean Owen

+1 same result as ever. Signatures are OK, tags look good, tests pass.

On Thu, Jun 17, 2021 at 5:11 AM Yi Wu  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.3.
>
> The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.0.3-rc1 (commit
> 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
> https://github.com/apache/spark/tree/v3.0.3-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1386/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
>
> The list of bug fixes going into 3.0.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349723
>
> This release is using the release script of the tag v3.0.3-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.3?
> ===
>
> The current list of open tickets targeted at 3.0.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [DISCUSS] Rename hadoop-3.2/hadoop-2.7 profile to hadoop-3/hadoop-2?

2021-06-24 Thread Sean Owen

The downside here is that it would break downstream builds that set
hadoop-3.2 if it's now called hadoop-3. That's not a huge deal. We can
retain dummy profiles under the old names that do nothing, but that would
be a quieter 'break'. I suppose this naming is only of importance to
developers, who might realize that hadoop-3.2 means "hadoop-3.2 or later".
And maybe the current naming leaves the possibility for a "hadoop-3.5" or
something if that needed to be different.

I don't feel strongly but would default to leaving it, very slightly.

On Thu, Jun 24, 2021 at 1:42 PM Chao Sun  wrote:

> Hi,
>
> As Spark master has upgraded to Hadoop-3.3.1, the current Maven profile
> name hadoop-3.2 is no longer accurate, and it may confuse Spark users when
> they realize the actual version is not Hadoop 3.2.x. Therefore, I created
> https://issues.apache.org/jira/browse/SPARK-33880 to change the profile
> name to hadoop-3 and hadoop-2 respectively. What do you think? Is this
> something worth doing as part of Spark 3.2.0 release?
>
> Best,
> Chao
>

Re: Removing references to Master

2021-07-09 Thread Sean Owen

We maybe don't need to litigate this one again. I do think this point of
view is legitimate, as is the point of view that 'master' is inextricably
linked to 'master/slave' as an unfortunate term of art; it did not
originate in reference to mastery of a skill but of another entity. Even if
one viewed this as a symbolic gesture at best, that has value. Nobody
believes this is going to end racism or anything, but has at least value as
a signal of community values.

This of course does have to balance against practical concerns. We don't
want to break APIs and code over this, I believe.

On Fri, Jul 9, 2021 at 9:55 AM Mich Talebzadeh 
wrote:

>
> Hi,
>
> I am afraid I have to differ on this if I may. This is the gist of it.
>
> The term master has been a part of English and English speaking culture
> for a very long time. Unlike " slave" this term "master" has no
> connotation to anything related to race, creed and so forth. It is widely
> used in the literature
>
> I believe this has been discussed in Spark forums before. Otherwise we may
> end up replacing master tournaments with leader tournaments, master class
> with leader class, master craftsman  with leader craftsman and so forth.
> Personally I  don't believe it will make any significant contribution. We
> should bear in mind these terminologies are interpreted as they are used
> and within the context they are used.
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Jul 2021 at 15:24, Thomas Graves  wrote:
>
>> Hey everyone,
>>
>> Looking at this again since we cut spark 3.2 branch thinking this
>> might be something to target for Spark 3.3.
>>
>> Based on the feedback, I'd like to propose using "Leader" to replace
>> "Master".   If there are objections to this please let me know in the
>> next few days.
>>
>> Thanks,
>> Tom
>>
>> On Tue, Jan 19, 2021 at 10:13 AM Tom Graves
>>  wrote:
>> >
>> > thanks for the interest, I haven't had time to work on replacing
>> Master, hopefully for the next release but time dependent, if you follow
>> the lira - https://issues.apache.org/jira/browse/SPARK-32333 - I will
>> post there when I start or if someone else picks it up should see activity
>> there.
>> >
>> > Tom
>> >
>> > On Saturday, January 16, 2021, 07:56:14 AM CST, João Paulo Leonidas
>> Fernandes Dias da Silva  wrote:
>> >
>> >
>> > So, it looks like slave was already replaced in the docs. Waiting for a
>> definition on the replacement(s) for master so I can create a PR for the
>> docs only.
>> >
>> > On Sat, Jan 16, 2021 at 8:30 AM jpaulorio  wrote:
>> >
>> > What about updating the documentation as well? Does it depend on the
>> codebase
>> > changes or can we treat it as a separate issue? I volunteer to update
>> both
>> > Master and Slave terms when there's an agreement on what should be used
>> as
>> > replacement. Since  [SPARK-32004]
>> >    was already
>> resolved,
>> > can I start replacing slave with worker?
>> >
>> >
>> >
>> > --
>> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Spark 3: Resource Discovery

2021-07-17 Thread Sean Owen

At the moment this is really about discovering GPUs, so that the scheduler
can schedule tasks that need to allocate whole GPUs.

On Sat, Jul 17, 2021 at 5:14 PM ayan guha  wrote:

> Hi
>
> As I was going through Spark 3 config params, I noticed following group of
> params. I could not understand what are they for. Can anyone please point
> me in the right direction?
>
> spark.driver.resource.{resourceName}.amount 0 Amount of a particular
> resource type to use on the driver. If this is used, you must also specify
> the spark.driver.resource.{resourceName}.discoveryScript for the driver
> to find the resource on startup. 3.0.0
> spark.driver.resource.{resourceName}.discoveryScript None A script for
> the driver to run to discover a particular resource type. This should write
> to STDOUT a JSON string in the format of the ResourceInformation class.
> This has a name and an array of addresses. For a client-submitted driver,
> discovery script must assign different resource addresses to this driver
> comparing to other drivers on the same host. 3.0.0
> spark.driver.resource.{resourceName}.vendor None Vendor of the resources
> to use for the driver. This option is currently only supported on
> Kubernetes and is actually both the vendor and domain following the
> Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes
> this config would be set to nvidia.com or amd.com) 3.0.0
> spark.resources.discoveryPlugin
> org.apache.spark.resource.ResourceDiscoveryScriptPlugin Comma-separated
> list of class names implementing
> org.apache.spark.api.resource.ResourceDiscoveryPlugin to load into the
> application. This is for advanced users to replace the resource discovery
> class with a custom implementation. Spark will try each class specified
> until one of them returns the resource information for that resource. It
> tries the discovery script last if none of the plugins return information
> for that resource. 3.0.0
>
> --
> Best Regards,
> Ayan Guha
>

Re: TreeNode.exists?

2021-08-11 Thread Sean Owen

If this is repeated a bunch of places in the code, sure, a utility method
could be good.
I think .find(x).isDefined is even not optimal - .exists(x) is a little
easier and may be slightly faster?
If you find a chance for refactoring, sure open a minor PR.

On Wed, Aug 11, 2021 at 9:42 AM Jacek Laskowski  wrote:

> Hi,
>
> It's been a couple of times already when I ran into a code like the
> following ([1]):
>
> val isCommand = plan.find {
>   case _: Command | _: ParsedStatement | _: InsertIntoDir => true
>   case _ => false
> }.isDefined
>
> I think that this and the other places beg (scream?) for TreeNode.exists
> that could do the simplest thing possible:
>
>   find(f).isDefined
>
> or even
>
>   collectFirst(...).isDefined
>
> It'd surely help a lot for people like myself reading the code. WDYT?
>
> [1]
> https://github.com/apache/spark/pull/33671/files#diff-4d16a733f8741de9a4b839ee7c356c3e9b439b4facc70018f5741da1e930c6a8R51-R54
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>

Re: Access to Apache GitHub

2021-08-15 Thread Sean Owen

No, we can't give write access to Apache repos of course, not to anyone but
committers.
People contribute by opening pull requests.


On Sun, Aug 15, 2021 at 10:11 AM Mich Talebzadeh 
wrote:

>
> Hi,
>
>
> With reference to  recent threads/discussions on creating ready-made
> docker images for spark, spark-py and spark-R, it would be great to create
> a project/repository in Apache Spark GitHub specifically for this [orpose.
>
>
> Would some administrator of this https://github.com/apache/spark
>
> give my email mich.talebza...@gmail.com appropriate access to create the
> said repository or please let me know who I need to contact?
>
>
> Regards,
>
>
> Mich
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Nabble archive is down

2021-08-17 Thread Sean Owen

If the links are down and not evidently coming back, yeah let's change any
website links. Probably best to depend on ASF resources foremost, but, the
ASF archive isn't searchable:
https://mail-archives.apache.org/mod_mbox/spark-user/

What about things like https://www.mail-archive.com/user@spark.apache.org/
for search? I don't have any experience or preference about third-party
archives to recommend for it.

Go ahead and open a PR, if you're willing.

On Tue, Aug 17, 2021 at 2:45 PM Maciej  wrote:

> Hi everyone,
>
> It seems like Nabble is downsizing and nX.nabble.com servers, including
> one with Spark user and dev lists, are already down. Do plan ask them to
> preserve the content (I haven't seen any related requests on their
> support forum) or should we update website links to point to the ASF
> archives?
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>
>

Re: Nabble archive is down

2021-08-17 Thread Sean Owen

Oh duh, right, much better idea!

On Tue, Aug 17, 2021 at 2:56 PM Micah Kornfield 
wrote:

> https://lists.apache.org/list.html?u...@spark.apache.org should be
> searchable (although the UI is a little clunky).
>
> On Tue, Aug 17, 2021 at 12:52 PM Sean Owen  wrote:
>
>> If the links are down and not evidently coming back, yeah let's change
>> any website links. Probably best to depend on ASF resources foremost, but,
>> the ASF archive isn't searchable:
>> https://mail-archives.apache.org/mod_mbox/spark-user/
>>
>> What about things like
>> https://www.mail-archive.com/user@spark.apache.org/ for search? I don't
>> have any experience or preference about third-party archives to recommend
>> for it.
>>
>> Go ahead and open a PR, if you're willing.
>>
>>
>> On Tue, Aug 17, 2021 at 2:45 PM Maciej  wrote:
>>
>>> Hi everyone,
>>>
>>> It seems like Nabble is downsizing and nX.nabble.com servers, including
>>> one with Spark user and dev lists, are already down. Do plan ask them to
>>> preserve the content (I haven't seen any related requests on their
>>> support forum) or should we update website links to point to the ASF
>>> archives?
>>>
>>> --
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>> Web: https://zero323.net
>>> Keybase: https://keybase.io/zero323
>>> Gigs: https://www.codementor.io/@zero323
>>> PGP: A30CEF0C31A501EC
>>>
>>>
>>>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-22 Thread Sean Owen

So far, I've tested Java 8 + Scala 2.12, Scala 2.13 and the results look
good per usual.
Good to see Scala 2.13 artifacts!! Unless I've forgotten something we're OK
for Scala 2.13 now, and Java 11 (and, IIRC, Java 14 works fine minus some
very minor corners of the project's deps)

I think we're going to have to have this fix, which just missed the 3.2 RC:
https://github.com/apache/spark/commit/c441c7e365cdbed4bae55e9bfdf94fa4a118fb21
I think that means we shouldn't release this RC, but, of course let's test.



On Fri, Aug 20, 2021 at 12:05 PM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.0.
>
> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc1 (commit
> 6bb3523d8e838bd2082fb90d7f3741339245c044):
> https://github.com/apache/spark/tree/v3.2.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1388
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-22 Thread Sean Owen

Jackson was bumped from 2.10.x to 2.12.x, which could well explain it if
you're exposed to the Spark classpath and have your own different Jackson
dep.

On Sun, Aug 22, 2021 at 1:21 PM Michael Heuer  wrote:

> We're seeing runtime classpath issues with Avro 1.10.2, Parquet 1.12.0,
> and Spark 3.2.0 RC1.
>
> Our dependency tree is deep though, and will require further investigation.
>
> https://github.com/bigdatagenomics/adam/pull/2289
>
> $ mvn test
> ...
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: com/fasterxml/jackson/annotation/JsonKey
>   at
> com.fasterxml.jackson.databind.introspect.JacksonAnnotationIntrospector.hasAsKey(JacksonAnnotationIntrospector.java:1080)
>   at
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.hasAsKey(AnnotationIntrospectorPair.java:611)
>   at
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.hasAsKey(AnnotationIntrospectorPair.java:611)
>   at
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:495)
>   at
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.collectAll(POJOPropertiesCollector.java:421)
>   at
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector.getJsonValueAccessor(POJOPropertiesCollector.java:270)
>   at
> com.fasterxml.jackson.databind.introspect.BasicBeanDescription.findJsonValueAccessor(BasicBeanDescription.java:258)
>   at
> com.fasterxml.jackson.databind.ser.BasicSerializerFactory.findSerializerByAnnotations(BasicSerializerFactory.java:391)
>   at
> com.fasterxml.jackson.databind.ser.BeanSerializerFactory._createSerializer2(BeanSerializerFactory.java:220)
>   at
> com.fasterxml.jackson.databind.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:169)
>   at
> com.fasterxml.jackson.databind.SerializerProvider._createUntypedSerializer(SerializerProvider.java:1473)
>   at
> com.fasterxml.jackson.databind.SerializerProvider._createAndCacheUntypedSerializer(SerializerProvider.java:1421)
>   at
> com.fasterxml.jackson.databind.SerializerProvider.findValueSerializer(SerializerProvider.java:520)
>   at
> com.fasterxml.jackson.databind.SerializerProvider.findTypedValueSerializer(SerializerProvider.java:798)
>   at
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:308)
>   at
> com.fasterxml.jackson.databind.ObjectMapper._writeValueAndClose(ObjectMapper.java:4487)
>   at
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3742)
>   at
> org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:52)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:145)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
>   at
> org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1239)
>   at
> org.bdgenomics.adam.ds.ADAMContext.readVcfRecords(ADAMContext.scala:2668)
>   at org.bdgenomics.adam.ds.ADAMContext.loadVcf(ADAMContext.scala:2686)
>   at
> org.bdgenomics.adam.ds.ADAMContext.loadVariants(ADAMContext.scala:3608)
>   at
> org.bdgenomics.adam.ds.variant.VariantDatasetSuite.$anonfun$new$1(VariantDatasetSuite.scala:128)
>   at
> org.bdgenomics.utils.misc.SparkFunSuite.$anonfun$sparkTest$1(SparkFunSuite.scala:111)
>
>
>
> On Aug 22, 2021, at 10:58 AM, Sean Owen  wrote:
>
> So far, I've tested Java 8 + Scala 2.12, Scala 2.13 and the results look
> good per usual.
> Good to see Scala 2.13 artifacts!! Unless I've forgotten something we're
> OK for Scala 2.13 now, and Java 11 (and, IIRC, Java 14 works fine minus
> some very minor corners of the project's deps)
>
> I think we're going to have to have this fix, which just missed the 3.2 RC:
>
> https://github.com/apache/spark/commit/c441c7e365cdbed4bae55e9bfdf94fa4a118fb21
> I think that means we shouldn't release this RC, but, of course let's test.
>
>
>
> On Fri, Aug 20, 2021 at 12:05 PM Gengliang Wang  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.2.0-rc1 (commit
>> 6bb3523d8e838bd2082fb90d7f3741339245c044):
>> https://github.com/apache/spark

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-24 Thread Sean Owen

I think we'll need this revert:
https://github.com/apache/spark/pull/33819

Between that and a few other minor but important issues I think I'd say -1
myself and ask for another RC.

On Tue, Aug 24, 2021 at 1:01 PM Jacek Laskowski  wrote:

> Hi Yi Wu,
>
> Looks like the issue has got resolution: Won't Fix. How about your -1?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Mon, Aug 23, 2021 at 4:58 AM Yi Wu  wrote:
>
>> -1. I found a bug (https://issues.apache.org/jira/browse/SPARK-36558) in
>> the push-based shuffle, which could lead to job hang.
>>
>> Bests,
>> Yi
>>
>> On Sat, Aug 21, 2021 at 1:05 AM Gengliang Wang  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>>  version 3.2.0.
>>>
>>> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.2.0-rc1 (commit
>>> 6bb3523d8e838bd2082fb90d7f3741339245c044):
>>> https://github.com/apache/spark/tree/v3.2.0-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1388
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/
>>>
>>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>
>>> This release is using the release script of the tag v3.2.0-rc1.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.2.0?
>>> ===
>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-26 Thread Sean Owen

Did you run ./dev/change-scala-version.sh 2.13 ? that's required first to
update POMs. It works fine for me.

On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy 
wrote:

> Hi all,
>
> Being adventurous I have built the RC1 code with:
>
> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver -Phive-2.3
> -Pscala-2.13 -Dhadoop.version=3.2.2
>
>
> And then attempted to build my Java based spark application.
>
> However, I found a number of our unit tests were failing with:
>
> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>
> at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
> at
> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
> at
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
> …
>
>
> I tracked this down to a missing dependency:
>
> 
>   org.scala-lang.modules
>
> scala-parallel-collections_${scala.binary.version}
> 
>
>
> which unfortunately appears only in a profile in the pom files associated
> with the various spark dependencies.
>
> As far as I know it is not possible to activate profiles in dependencies
> in maven builds.
>
> Therefore I suspect that right now a Scala 2.13 migration is not quite as
> seamless as we would like.
>
> I stress that this is only an issue for developers that write unit tests
> for their applications, as the Spark runtime environment will always have
> the necessary dependencies available to it.
>
> (You might consider upgrading the
> org.scala-lang.modules:scala-parallel-collections_2.13 version from 0.2 to
> 1.0.3 though!)
>
> Cheers and thanks for the great work!
>
> Steve Coy
>
>
> On 21 Aug 2021, at 3:05 am, Gengliang Wang  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.2.0.
>
> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
>
> The tag to be voted on is v3.2.0-rc1 (commit
> 6bb3523d8e838bd2082fb90d7f3741339245c044):
> https://github.com/apache/spark/tree/v3.2.0-rc1
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-bin/
> 
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1388
>

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-26 Thread Sean Owen

OK right, you would have seen a different error otherwise.

Yes profiles are only a compile-time thing, but they should affect the
effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
scala-parallel-collections as a dependency in the POM as expected (not in a
profile). However I see what you see in the .pom in the release repo, and
in my local repo after building - it's just sitting there as a profile as
if it weren't activated or something.

I'm confused then, that shouldn't be what happens. I'd say maybe there is a
problem with the release script, but seems to affect a simple local build.
Anyone else more expert in this see the problem, while I try to debug more?
The binary distro may actually be fine, I'll check; it may even not matter
much for users who generally just treat Spark as a compile-time-only
dependency either. But I can see it would break exactly your case,
something like a self-contained test job.

On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy  wrote:

> I did indeed.
>
> The generated spark-core_2.13-3.2.0.pom that is created alongside the jar
> file in the local repo contains:
>
> 
>   scala-2.13
>   
> 
>   org.scala-lang.modules
>
> scala-parallel-collections_${scala.binary.version}
> 
>   
> 
>
> which means this dependency will be missing for unit tests that create
> SparkSessions from library code only, a technique inspired by Spark’s own
> unit tests.
>
> Cheers,
>
> Steve C
>
> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>
> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first to
> update POMs. It works fine for me.
>
> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy 
> wrote:
>
>> Hi all,
>>
>> Being adventurous I have built the RC1 code with:
>>
>> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver -Phive-2.3
>> -Pscala-2.13 -Dhadoop.version=3.2.2
>>
>>
>> And then attempted to build my Java based spark application.
>>
>> However, I found a number of our unit tests were failing with:
>>
>> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>>
>> at org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
>> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
>> at
>> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
>> at
>> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>> …
>>
>>
>> I tracked this down to a missing dependency:
>>
>> 
>>   org.scala-lang.modules
>>
>> scala-parallel-collections_${scala.binary.version}
>> 
>>
>>
>> which unfortunately appears only in a profile in the pom files associated
>> with the various spark dependencies.
>>
>> As far as I know it is not possible to activate profiles in dependencies
>> in maven builds.
>>
>> Therefore I suspect that right now a Scala 2.13 migration is not quite as
>> seamless as we would like.
>>
>> I stress that this is only an issue for developers that write unit tests
>> for their applications, as the Spark runtime environment will always have
>> the necessary dependencies available to it.
>>
>> (You might consider upgrading the
>> org.scala-lang.modules:scala-parallel-collections_2.13 version from 0.2 to
>> 1.0.3 though!)
>>
>> Cheers and thanks for the great work!
>>
>> Steve Coy
>>
>>
>> On 21 Aug 2021, at 3:05 am, Gengliang Wang  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.2.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 25 and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.2.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> <https://aus01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2F&data=04%7C01%7Cscoy%40infomedia.com.au%7Ca931c1f56adb435bc97008d968faba6e%7C45d5407150f849caa59f9457123dc71c%7C0%7C1%7C637656248357375097%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=F1lGO%2BCKG8cc8sCWQDkhhYRS%2F09jv95H%2Fg%2BriRaQAqk%3D&reserved=0&g

Re: [VOTE] Release Spark 3.2.0 (RC1)

2021-08-27 Thread Sean Owen

Maybe, I'm just confused why it's needed at all. Other profiles that add a
dependency seem OK, but something's different here.

One thing we can/should change is to simply remove the
 block in the profile. It should always be a direct
dep in Scala 2.13 (which lets us take out the profiles in submodules, which
just repeat that)
We can also update the version, by the by.

I tried this and the resulting POM still doesn't look like what I expect
though.

(The binary release is OK, FWIW - it gets pulled in as a JAR as expected)

On Thu, Aug 26, 2021 at 11:34 PM Stephen Coy  wrote:

> Hi Sean,
>
> I think that maybe the https://www.mojohaus.org/flatten-maven-plugin/ will
> help you out here.
>
> Cheers,
>
> Steve C
>
> On 27 Aug 2021, at 12:29 pm, Sean Owen  wrote:
>
> OK right, you would have seen a different error otherwise.
>
> Yes profiles are only a compile-time thing, but they should affect the
> effective POM for the artifact. mvn -Pscala-2.13 help:effective-pom shows
> scala-parallel-collections as a dependency in the POM as expected (not in a
> profile). However I see what you see in the .pom in the release repo, and
> in my local repo after building - it's just sitting there as a profile as
> if it weren't activated or something.
>
> I'm confused then, that shouldn't be what happens. I'd say maybe there is
> a problem with the release script, but seems to affect a simple local
> build. Anyone else more expert in this see the problem, while I try to
> debug more?
> The binary distro may actually be fine, I'll check; it may even not matter
> much for users who generally just treat Spark as a compile-time-only
> dependency either. But I can see it would break exactly your case,
> something like a self-contained test job.
>
> On Thu, Aug 26, 2021 at 8:41 PM Stephen Coy  wrote:
>
>> I did indeed.
>>
>> The generated spark-core_2.13-3.2.0.pom that is created alongside the jar
>> file in the local repo contains:
>>
>> 
>>   scala-2.13
>>   
>> 
>>   org.scala-lang.modules
>>
>> scala-parallel-collections_${scala.binary.version}
>> 
>>   
>> 
>>
>> which means this dependency will be missing for unit tests that create
>> SparkSessions from library code only, a technique inspired by Spark’s own
>> unit tests.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 27 Aug 2021, at 11:33 am, Sean Owen  wrote:
>>
>> Did you run ./dev/change-scala-version.sh 2.13 ? that's required first to
>> update POMs. It works fine for me.
>>
>> On Thu, Aug 26, 2021 at 8:33 PM Stephen Coy <
>> s...@infomedia.com.au.invalid> wrote:
>>
>>> Hi all,
>>>
>>> Being adventurous I have built the RC1 code with:
>>>
>>> -Pyarn -Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver
>>> -Phive-2.3 -Pscala-2.13 -Dhadoop.version=3.2.2
>>>
>>>
>>> And then attempted to build my Java based spark application.
>>>
>>> However, I found a number of our unit tests were failing with:
>>>
>>> java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport
>>>
>>> at
>>> org.apache.spark.SparkContext.$anonfun$union$1(SparkContext.scala:1412)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>>> at org.apache.spark.SparkContext.withScope(SparkContext.scala:789)
>>> at org.apache.spark.SparkContext.union(SparkContext.scala:1406)
>>> at
>>> org.apache.spark.sql.execution.UnionExec.doExecute(basicPhysicalOperators.scala:698)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
>>> …
>>>
>>>
>>> I tracked this down to a missing dependency:
>>>
>>> 
>>>   org.scala-lang.modules
>>>
>>> scala-parallel-collections_${scala.binary.version}
>>> 
>>>
>>>
>>> which unfortunately appears only in a profile in the pom files
>>> associated with the various spark dependencies.
>>>
>>> As far as I know it is not possible to activate profiles in dependencies
>>> in maven builds.
>>>
>>> Therefore I suspect that right now a Scala 2.13 migration is not quite
>>> as seamless as we would like.
>>>
>>> I stress that this is only an issue for developers that write unit tests
>>> for their applications, as the Spark runt

Re: Question using multiple partition for Window cumulative functions when partition is not specified.

2021-08-30 Thread Sean Owen

You just have 1 partition here because the input is so small. You can
always repartition this further for parallelism.
Is the issue that you're not partitioning the window itself, maybe?

On Mon, Aug 30, 2021 at 12:59 AM Haejoon Lee 
wrote:

> Hi all,
>
> I noticed that Spark uses only one partition when performing Window
> cumulative functions without specifying the partition, so all the dataset
> is moved into a single partition which easily causes OOM or serious
> performance degradation.
>
> See the example below:
>
> >>> from pyspark.sql import functions as F, Window
> >>> sdf = spark.range(10)
> >>> sdf.select(F.sum(sdf["id"]).over(Window.rowsBetween(Window.unboundedPreceding,
> >>>  Window.currentRow))).show()
> ...
> WARN WindowExec: No Partition Defined for Window operation! Moving all data 
> to a single partition, this can cause serious performance degradation.
> ...
> +---+
> |sum(id) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
> +---+
> |  0|
> |  1|
> |  3|
> |  6|
> | 10|
> | 15|
> | 21|
> | 28|
> | 36|
> | 45|
> +---+
>
> As shown in the example, the window cumulative function requires the
> result of the previous operation to be used for the next operation. In
> Spark, it is calculated by simply moving all data to one partition if a
> partition is not specified.
>
> To overcome this, for example in Dask, they introduce the concept of 
> Overlapping
> Computations , which
> creates the copies of the entire dataset into multiple blocks and
> sequentially performs the cumulative function, when the dataset exceeds the
> memory size.
>
> Of course, this method requires more cost for creating the copies and
> communication of each block, but it allows performing cumulative functions
> when even the size of the dataset exceeds the size of the memory, rather
> than causing the OOM.
>
> So, it's the way to simply resolve the out-of-memory issue without any
> performance advantage, though.
>
> I think maybe this kind of use case is pretty common in data science, but
> I wonder how frequent these use cases are in Apache Spark.
>
> Would it be helpful to implement this way in Apache Spark when doing
> Window cumulative functions on out-of-memory data without specifying
> partition??
>
> Check here  where the
> issue was firstly initiated, for more detail.
>
>
> Best,
>
> Haejoon.
>

Re: [VOTE] Release Spark 3.2.0 (RC2)

2021-09-01 Thread Sean Owen

This RC looks OK to me too, understanding we may need to have RC3 for the
outstanding issues though.

The issue with the Scala 2.13 POM is still there; I wasn't able to figure
it out (anyone?), though it may not affect 'normal' usage (and is
work-around-able in other uses, it seems), so may be sufficient if Scala
2.13 support is experimental as of 3.2.0 anyway.


On Wed, Sep 1, 2021 at 2:08 AM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 3 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc2 (commit
> 6bb3523d8e838bd2082fb90d7f3741339245c044):
> https://github.com/apache/spark/tree/v3.2.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1389
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [SQL][AQE] Advice needed: a trivial code change with a huge reading impact?

2021-09-08 Thread Sean Owen

That does seem pointless. The body could just be .flatten()-ed to achieve
the same result. Maybe it was just written that way for symmetry with the
block above. You could open a PR to change it.

On Wed, Sep 8, 2021 at 4:31 AM Jacek Laskowski  wrote:

> Hi Spark Devs,
>
> I'm curious what your take on this code [1] would be if you were me trying
> to understand it:
>
>   (0 until 1).flatMap { _ =>
> (splitPoints :+ numMappers).sliding(2).map {
>   case Seq(start, end) => CoalescedMapperPartitionSpec(start, end,
> numReducers)
> }
>   }
>
> There's something important going on here but it's so convoluted that my
> Scala coding skills seem not enough (not to mention AQE skills themselves).
>
> I'm tempted to change (0 until 1) to Seq(0), but Seq(0).flatMap feels
> awkward too. Is this Seq(0).flatMap even needed?! Even with no splitPoints
> we've got numReducers > 0.
>
> Looks like the above is as simple as
>
> (splitPoints :+ numMappers).sliding(2).map {
>   case Seq(start, end) => CoalescedMapperPartitionSpec(start, end,
> numReducers)
>  }
>
> Correct?
>
> I'm mentally used up and can't seem to think straight. Would a PR with
> such a change be acceptable? (Sean I'm looking at you :D)
>
> [1]
> https://github.com/apache/spark/blob/8d817dcf3084d56da22b909d578a644143f775d5/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala#L89-L93
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>

Re: Adding Spark 4 to JIRA for targetted versions

2021-09-13 Thread Sean Owen

Sure, doesn't hurt to have a placeholder.

On Mon, Sep 13, 2021, 5:32 PM Holden Karau  wrote:

> Hi Folks,
>
> I'm going through the Spark 3.2 tickets just to make sure were not missing
> anything important and I was wondering what folks thoughts are on adding
> Spark 4 so we can target API breaking changes to the next major version and
> avoid loosing track of the issue.
>
> Cheers,
>
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [VOTE] Release Spark 3.2.0 (RC3)

2021-09-20 Thread Sean Owen

+1 from me, same results as the last RC from my side.
The Scala 2.13 POM issue was resolved and the 2.13 build appears to be OK.

On Sat, Sep 18, 2021 at 10:19 PM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 24 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc3 (commit
> 96044e97353a079d3a7233ed3795ca82f3d9a101):
> https://github.com/apache/spark/tree/v3.2.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1390
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc3.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE] Release Spark 3.2.0 (RC3)

2021-09-21 Thread Sean Owen

Hm yeah I tend to agree. See https://github.com/apache/spark/pull/33912
This _is_ a test-only dependency which makes it less of an issue.
I'm guessing it's not in Maven as it's a small one-off utility; we _could_
just inline the ~100 lines of code in test code instead?

On Tue, Sep 21, 2021 at 12:33 AM Stephen Coy 
wrote:

> Hi there,
>
> I was going to -1 this because of the com.github.rdblue:brotli-codec:0.1.1
> dependency, which is not available on Maven Central, and therefore is not
> available from our repository manager (Nexus).
>
> Historically  most places I have worked have avoided other public maven
> repositories because they are not well curated. i.e artifacts with the same
> GAV have been known to change over time, which never happens with Maven
> Central.
>
> I know that I can address this by changing my settings.xml file.
>
> Anyway, I can see this biting other people so I thought that I would
> mention it.
>
> Steve C
>
> On 19 Sep 2021, at 1:18 pm, Gengliang Wang  wrote:
>
> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 24 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
>
> The tag to be voted on is v3.2.0-rc3 (commit
> 96044e97353a079d3a7233ed3795ca82f3d9a101):
> https://github.com/apache/spark/tree/v3.2.0-rc3
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-bin/
> 
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1390
> 
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc3-docs/
> 
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
>

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Sean Owen

Hm... it does just affect Mac OS (?) and only if you don't have JAVA_HOME
set (which people often do set) and only affects build/mvn, vs built-in
maven (which people often have installed). Only affects those building. I'm
on the fence about whether it blocks 3.2.0, as it doesn't affect downstream
users and is easily resolvable.

On Mon, Sep 27, 2021 at 10:26 AM sarutak  wrote:

> Hi All,
>
> SPARK-35887 seems to have introduced another issue that building with
> build/mvn on macOS stucks, and SPARK-36856 will resolve this issue.
> Should we meet the fix to 3.2.0?
>
> - Kousuke
>
> > Please vote on releasing the following candidate as Apache Spark
> > version 3.2.0.
> >
> > The vote is open until 11:59pm Pacific time September 29 and passes if
> > a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.2.0
> >
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v3.2.0-rc5 (commit
> > 49aea14c5afd93ae1b9d19b661cc273a557853f5):
> >
> > https://github.com/apache/spark/tree/v3.2.0-rc5
> >
> > The release files, including signatures, digests, etc. can be found
> > at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> >
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> >
> > https://repository.apache.org/content/repositories/orgapachespark-1392
> >
> > The documentation corresponding to this release can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
> >
> > The list of bug fixes going into 3.2.0 can be found at the following
> > URL:
> >
> > https://issues.apache.org/jira/projects/SPARK/versions/12349407
> >
> > This release is using the release script of the tag v3.2.0-rc5.
> >
> > FAQ
> >
> > =
> >
> > How can I help test this release?
> >
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> >
> > an existing Spark workload and running on this release candidate, then
> >
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> >
> > the current RC and see if anything important breaks, in the Java/Scala
> >
> > you can add the staging repository to your projects resolvers and test
> >
> > with the RC (make sure to clean up the artifact cache before/after so
> >
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> >
> > What should happen to JIRA tickets still targeting 3.2.0?
> >
> > ===
> >
> > The current list of open tickets targeted at 3.2.0 can be found at:
> >
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.2.0
> >
> > Committers should look at those and triage. Extremely important bug
> >
> > fixes, documentation, and API tweaks that impact compatibility should
> >
> > be worked on immediately. Everything else please retarget to an
> >
> > appropriate release.
> >
> > ==
> >
> > But my bug isn't fixed?
> >
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> >
> > release unless the bug in question is a regression from the previous
> >
> > release. That being said, if there is something which is a regression
> >
> > that has not been correctly targeted please ping me or a committer to
> >
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Sean Owen

Has anyone seen a StackOverflowError when running tests? It happens in
compilation. I heard from another user who hit this earlier, and I had not,
until just today testing this:

[ERROR] ## Exception when compiling 495 sources to
/mnt/data/testing/spark-3.2.0/sql/catalyst/target/scala-2.12/classes
java.lang.StackOverflowError
scala.tools.nsc.transform.TypingTransformers$TypingTransformer.atOwner(TypingTransformers.scala:38)
scala.reflect.internal.Trees.itransform(Trees.scala:1420)
scala.reflect.internal.Trees.itransform$(Trees.scala:1400)
scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28)
...

Upping the JVM thread stack size to, say, 16m from 4m in the pom.xml file
made it work. I presume this could be somehow env-specific, as clearly the
CI/CD tests and release process built successfully. Just checking if it's
"just me".


On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 29 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc5 (commit
> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
> https://github.com/apache/spark/tree/v3.2.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1392
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Sean Owen

Another "is anyone else seeing this"? in compiling common/yarn-network:

[ERROR] [Error]
/mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
package com.google.common.annotations does not exist
[ERROR] [Error]
/mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
package com.google.common.base does not exist
[ERROR] [Error]
/mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
package com.google.common.collect does not exist
...

I didn't see this in RC4, so, I wonder if a recent change affected
something, but there are barely any changes since RC4. Anything touching
YARN or Guava maybe, like:
https://github.com/apache/spark/commit/540e45c3cc7c64e37aa5c1673c03a0f2d7462878
?



On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 29 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc5 (commit
> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
> https://github.com/apache/spark/tree/v3.2.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1392
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc5.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE] Release Spark 3.2.0 (RC5)

2021-09-27 Thread Sean Owen

I'm building and testing with

mvn -Phadoop-3.2 -Phive -Phive-2.3 -Phive-thriftserver -Pkinesis-asl
-Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl
-Psparkr -Pyarn ...

I did a '-DskipTests clean install' and then 'test'; the problem arises
only in 'test'.

On Mon, Sep 27, 2021 at 6:58 PM Chao Sun  wrote:

> Hmm it may be related to the commit. Sean: how do I reproduce this?
>
> On Mon, Sep 27, 2021 at 4:56 PM Sean Owen  wrote:
>
>> Another "is anyone else seeing this"? in compiling common/yarn-network:
>>
>> [ERROR] [Error]
>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>> package com.google.common.annotations does not exist
>> [ERROR] [Error]
>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>> package com.google.common.base does not exist
>> [ERROR] [Error]
>> /mnt/data/testing/spark-3.2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>> package com.google.common.collect does not exist
>> ...
>>
>> I didn't see this in RC4, so, I wonder if a recent change affected
>> something, but there are barely any changes since RC4. Anything touching
>> YARN or Guava maybe, like:
>>
>> https://github.com/apache/spark/commit/540e45c3cc7c64e37aa5c1673c03a0f2d7462878
>> ?
>>
>>
>>
>> On Mon, Sep 27, 2021 at 7:56 AM Gengliang Wang  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.2.0.
>>>
>>> The vote is open until 11:59pm Pacific time September 29 and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.2.0-rc5 (commit
>>> 49aea14c5afd93ae1b9d19b661cc273a557853f5):
>>> https://github.com/apache/spark/tree/v3.2.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1392
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.2.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>>>
>>> This release is using the release script of the tag v3.2.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.2.0?
>>> ===
>>> The current list of open tickets targeted at 3.2.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.2.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

Re: [VOTE] Release Spark 3.2.0 (RC6)

2021-09-29 Thread Sean Owen

+1 looks good to me as before, now that a few recent issues are resolved.


On Tue, Sep 28, 2021 at 10:45 AM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time September 30 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc6 (commit
> dde73e2e1c7e55c8e740cb159872e081ddfa7ed6):
> https://github.com/apache/spark/tree/v3.2.0-rc6
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1393
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc6-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc6.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-07 Thread Sean Owen

+1 again. Looks good in Scala 2.12, 2.13, and in Java 11.
I note that the mem requirements for Java 11 tests seem to need to be
increased but we're handling that separately. It doesn't really affect
users.

On Wed, Oct 6, 2021 at 11:49 AM Gengliang Wang  wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.2.0.
>
> The vote is open until 11:59pm Pacific time October 11 and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.2.0-rc7 (commit
> 5d45a415f3a29898d92380380cfd82bfc7f579ea):
> https://github.com/apache/spark/tree/v3.2.0-rc7
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1394
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc7-docs/
>
> The list of bug fixes going into 3.2.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349407
>
> This release is using the release script of the tag v3.2.0-rc7.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.0?
> ===
> The current list of open tickets targeted at 3.2.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: [VOTE][RESULT] Release Spark 3.2.0 (RC7)

2021-10-17 Thread Sean Owen

That is the final 3.2.0 release. The rest of the release process is still
completing.

On Sun, Oct 17, 2021 at 5:39 AM Alex Ott  wrote:

> Hi
>
> I see Spark 3.2.0 released to Maven Central already:
> https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.2.0
> - is it RC7?
>
> On Thu, Oct 14, 2021 at 3:13 PM Gengliang Wang  wrote:
>
>> Hi all,
>>
>> FYI the size of the PySpark tarball exceeds the file size limit of PyPI.
>> I am still waiting for the issue
>> https://github.com/pypa/pypi-support/issues/1374 to be resolved.
>>
>> Gengliang
>>
>> On Tue, Oct 12, 2021 at 3:26 PM Bode, Meikel, NMA-CFD <
>> meikel.b...@bertelsmann.de> wrote:
>>
>>> Yes.  Genliang. Many thanks.
>>>
>>>
>>>
>>> *From:* Mich Talebzadeh 
>>> *Sent:* Dienstag, 12. Oktober 2021 09:25
>>> *To:* Gengliang Wang 
>>> *Cc:* dev 
>>> *Subject:* Re: [VOTE][RESULT] Release Spark 3.2.0 (RC7)
>>>
>>>
>>>
>>> great work Gengliang. Thanks for your tremendous contribution!
>>>
>>>
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7C9c74083248d04b3451e208d98d517286%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637696203339505402%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=v7SDwdd3dpPwfImH6OZofILshZoicZ9kyL3r9rLE3yY%3D&reserved=0>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, 12 Oct 2021 at 08:15, Gengliang Wang  wrote:
>>>
>>> The vote passes with 28 +1s (10 binding +1s).
>>> Thanks to all who helped with the release!
>>>
>>>
>>>
>>> (* = binding)
>>> +1:
>>>
>>> - Gengliang Wang
>>>
>>> - Michael Heuer
>>>
>>> - Mridul Muralidharan *
>>>
>>> - Sean Owen *
>>>
>>> - Ruifeng Zheng
>>>
>>> - Dongjoon Hyun *
>>>
>>> - Yuming Wang
>>>
>>> - Reynold Xin *
>>>
>>> - Cheng Su
>>>
>>> - Peter Toth
>>>
>>> - Mich Talebzadeh
>>>
>>> - Maxim Gekk
>>>
>>> - Chao Sun
>>>
>>> - Xinli Shang
>>>
>>> - Huaxin Gao
>>>
>>> - Kent Yao
>>>
>>> - Liang-Chi Hsieh *
>>>
>>> - Kousuke Saruta *
>>>
>>> - Ye Zhou
>>>
>>> - Cheng Pan
>>>
>>> - Angers Zhu
>>>
>>> - Wenchen Fan *
>>>
>>> - Holden Karau *
>>>
>>> - Yi Wu
>>>
>>> - Ricardo Almeida
>>>
>>> - DB Tsai *
>>>
>>> - Thomas Graves *
>>>
>>> - Terry Kim
>>>
>>>
>>>
>>> +0: None
>>>
>>> -1: None
>>>
>>>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>

Re: Update Spark 3.3 release window?

2021-10-27 Thread Sean Owen

Seems fine to me - as good a placeholder as anything.
Would that be about time to call 2.x end-of-life?

On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:

> Hi all,
>
> Spark 3.2. is out. Shall we update the release window
> https://spark.apache.org/versioning-policy.html?
> I am thinking of Mid March 2022 (5 months after the 3.2 release) for code
> freeze and onward.
>
>

Re: Jira components cleanup

2021-11-15 Thread Sean Owen

Done. Now let's see if that generated 86 update emails!

On Mon, Nov 15, 2021 at 11:03 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

>
> https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page
>
> I think the "docs" component should be merged into "Documentation".
>
> Likewise, the "k8" component should be merged into "Kubernetes".
>
> I think anyone can technically update tags, but I think mass retagging
> should be limited to admins (or at least, to someone who got prior approval
> from an admin).
>
> Nick
>
>

Re: Scala 3 support approach

2021-12-03 Thread Sean Owen

I don't think anyone's tested it or tried it, but if it's pretty compatible
with 2.13, it may already work, or mostly.

See my answer below, which still stands: if it's not pretty compatible with
2.13 and needs a new build, this effectively means dropping 2.12 support,
as supporting 3 Scala versions is a bit too much at once.
And the downstream library dependencies are still likely a partial problem.

Have you or anyone interested in this tried it out? that's the best way to
make progress.
I do not think this would go into any Spark release on the horizon.

On Fri, Dec 3, 2021 at 12:04 PM Igor Dvorzhak  wrote:

> Are there any plans to support Scala 3 in the upcoming Spark 3.3 release?
>
> On Sun, Oct 18, 2020 at 11:10 PM Dongjoon Hyun 
> wrote:
>
>> Hi, Koert.
>>
>> We know, welcome, and believe it. However, it's only Scala community's
>> roadmap so far. It doesn't mean Apache Spark supports Scala 3 officially.
>>
>> For example, Apache Spark 3.0.1 supports Scala 2.12.10 but not 2.12.12
>> due to Scala issue.
>>
>> In Apache Spark community, we had better focus on 2.13. After that, we
>> will see what is needed for Scala 3.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sun, Oct 18, 2020 at 1:33 PM Koert Kuipers  wrote:
>>
>>> i think scala 3.0 will be able to use libraries built with Scala 2.13
>>> (as long as they dont use macros)
>>>
>>> see:
>>> https://www.scala-lang.org/2019/12/18/road-to-scala-3.html
>>>
>>> On Sun, Oct 18, 2020 at 9:54 AM Sean Owen  wrote:
>>>
>>>> Spark depends on a number of Scala libraries, so needs them all to
>>>> support version X before Spark can. This only happened for 2.13 about 4-5
>>>> months ago. I wonder if even a fraction of the necessary libraries have 3.0
>>>> support yet?
>>>>
>>>> It can be difficult to test and support multiple Scala versions
>>>> simultaneously. 2.11 has already been dropped and 2.13 is coming, but it
>>>> might be hard to have a code base that works for 2.12, 2.13, and 3.0.
>>>>
>>>> So one dependency could be, when can 2.12 be dropped? And with Spark
>>>> supporting 2.13 only early next year, and user apps migrating over a year
>>>> or more, it seems difficult to do that anytime soon.
>>>>
>>>> I think Spark 3 support is eventually desirable, so maybe the other way
>>>> to resolve that is to show that Spark 3 support doesn't interfere much with
>>>> maintenance of 2.12/2.13 support. I am a little bit skeptical of it, just
>>>> because the 2.11->2.12 and 2.12->2.13 changes were fairly significant, let
>>>> alone 2.13->3.0 I'm sure, but I don't know.
>>>>
>>>> That is, if we start to have to implement workarounds are parallel code
>>>> trees and so on for 3.0 support, and if it can't be completed for a while
>>>> to come because of downstream dependencies, then it may not be worth
>>>> iterating in the code base yet or even considering.
>>>>
>>>> You can file an umbrella JIRA to track it, yes, with a possible target
>>>> of Spark 4.0. Non-intrusive changes can go in anytime. We may not want to
>>>> get into major ones until later.
>>>>
>>>> On Sat, Oct 17, 2020 at 8:49 PM gemelen  wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> I'd like to ask for an opinion and discuss the next thing:
>>>>> at this moment in general Spark could be built with Scala 2.11 and
>>>>> 2.12 (mostly), and close to the point to have support for Scala 2.13. On
>>>>> the other hand, Scala 3 is going into the pre-release phase (with 3.0.0-M1
>>>>> released at the beginning of October).
>>>>>
>>>>> Previously, support of the current Scala version by Spark was a bit
>>>>> behind of desired state, dictated by all circumstances. To move things
>>>>> differently with Scala 3 I'd like to contribute my efforts (and help 
>>>>> others
>>>>> if there would be any) to support it starting as soon as possible (ie to
>>>>> have Spark build compiled with Scala 3 and to have release artifacts when
>>>>> it would be possible).
>>>>>
>>>>> I suggest that it would require to add an experimental profile to the
>>>>> build file so further changes to compile, test and run other tasks could 
>>>>> be
>>>>> don

Re: Time for Spark 3.2.1?

2021-12-06 Thread Sean Owen

Always fine by me if someone wants to roll a release.

It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new
release of those wouldn't hurt either, if any of our release managers have
the time or inclination. 3.0.x is reaching unofficial end-of-life around
now anyway.

On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon  wrote:

> Hi all,
>
> It's been two months since Spark 3.2.0 release, and we have resolved many
> bug fixes and regressions. What do you guys think about rolling Spark 3.2.1
> release?
>
> cc @huaxin gao  FYI who I happened to overhear
> that is interested in rolling the maintenance release :-).
>

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Sean Owen

Check the CVE - the log4j vulnerability appears to affect log4j 2, not 1.x.
There was mention that it could affect 1.x when used with JNDI or SMS
handlers, but Spark does neither. (unless anyone can think of something I'm
missing, but never heard or seen that come up at all in 7 years in Spark)

The big issue would be applications that themselves configure log4j 2.x,
but that's not a Spark issue per se.

On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
wrote:

> Hi developers,  users
>
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
> recent CVE detected ?
>
>
> Regards
> Pralabh kumar
>

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Sean Owen

This has come up several times over years - search JIRA. The very short
summary is: Spark does not use log4j 1.x, but its dependencies do, and
that's the issue.
Anyone that can successfully complete the surgery at this point is welcome
to, but I failed ~2 years ago.

On Mon, Dec 13, 2021 at 10:02 AM Jörn Franke  wrote:

> Is it in any case appropriate to use log4j 1.x which is not maintained
> anymore and has other security vulnerabilities which won’t be fixed anymore
> ?
>
> Am 13.12.2021 um 06:06 schrieb Sean Owen :
>
> 
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not
> 1.x. There was mention that it could affect 1.x when used with JNDI or SMS
> handlers, but Spark does neither. (unless anyone can think of something I'm
> missing, but never heard or seen that come up at all in 7 years in Spark)
>
> The big issue would be applications that themselves configure log4j 2.x,
> but that's not a Spark issue per se.
>
> On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
> wrote:
>
>> Hi developers,  users
>>
>> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
>> recent CVE detected ?
>>
>>
>> Regards
>> Pralabh kumar
>>
>

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread Sean Owen

You would want to shade this dependency in your app, in which case you
would be using log4j 2. If you don't shade and just include it, you will
also be using log4j 2 as some of the API classes are different. If they
overlap with log4j 1, you will probably hit errors anyway.

On Mon, Dec 13, 2021 at 6:33 PM James Yu  wrote:

> Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x
> and gets submitted to the Spark cluster.  Which version of log4j gets
> actually used during the Spark session?
> --
> *From:* Sean Owen 
> *Sent:* Monday, December 13, 2021 8:25 AM
> *To:* Jörn Franke 
> *Cc:* Pralabh Kumar ; dev ;
> user.spark 
> *Subject:* Re: Log4j 1.2.17 spark CVE
>
> This has come up several times over years - search JIRA. The very short
> summary is: Spark does not use log4j 1.x, but its dependencies do, and
> that's the issue.
> Anyone that can successfully complete the surgery at this point is welcome
> to, but I failed ~2 years ago.
>
> On Mon, Dec 13, 2021 at 10:02 AM Jörn Franke  wrote:
>
> Is it in any case appropriate to use log4j 1.x which is not maintained
> anymore and has other security vulnerabilities which won’t be fixed anymore
> ?
>
> Am 13.12.2021 um 06:06 schrieb Sean Owen :
>
> 
> Check the CVE - the log4j vulnerability appears to affect log4j 2, not
> 1.x. There was mention that it could affect 1.x when used with JNDI or SMS
> handlers, but Spark does neither. (unless anyone can think of something I'm
> missing, but never heard or seen that come up at all in 7 years in Spark)
>
> The big issue would be applications that themselves configure log4j 2.x,
> but that's not a Spark issue per se.
>
> On Sun, Dec 12, 2021 at 10:46 PM Pralabh Kumar 
> wrote:
>
> Hi developers,  users
>
> Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
> recent CVE detected ?
>
>
> Regards
> Pralabh kumar
>
>

Re: Log4j 1.2.17 spark CVE

2021-12-14 Thread Sean Owen

FWIW here is the Databricks statement on it. Not the same as Spark but
includes Spark of course.

https://databricks.com/blog/2021/12/13/log4j2-vulnerability-cve-2021-44228-research-and-assessment.html

Yes the question is almost surely more whether user apps are affected, not
Spark itself.

On Tue, Dec 14, 2021, 7:55 AM Steve Loughran 
wrote:

> log4j 1.2.17 is not vulnerable. There is an existing CVE there from a log
> aggregation servlet; Cloudera products ship a patched release with that
> servlet stripped...asf projects are not allowed to do that.
>
> But: some recent Cloudera Products do include log4j 2.x, so colleagues of
> mine are busy patching and retesting everything. If anyone replaces the
> vulnerable jars themselves, remember to look in spark.tar.gz on hdfs to
> make sure it is safe.
>
>
> hadoop stayed on log4j 1.2.17 because 2.x
> * would have broken all cluster management tools which configured
> log4j.properties files
> * wouldn't let us use System properties to can I figure logging... That is
> really useful when you want to run a job with debug logging
> * didn't support the no capture we use in mockito and functional tests
>
> But: the SLF4J it's used throughout; spark doesn't need to be held back by
> that choice and can use any backend you want
>
> I don't know what we will do now; akira has just suggested logback
> https://issues.apache.org/jira/browse/HADOOP-12956
>
> had I not just broken a collar bone and so unable to code, I would have
> added a new command to audit the the hadoop class path to verify it wasn't
> vulnerable. Someone could do the same for spark -where you would want an
> RDD where the probe would also take place in worker tasks to validate the
> the cluster safety more broadly, including the tarball.
>
> meanwhile, if your product is not exposed -probably worth mentioning on
> the users mailing list so as to help people focus their attention. It's
> probably best to work with everyone who produces spark based Products so
> that you can have a single summary.
>
> On Tue, 14 Dec 2021 at 01:31, Qian Sun  wrote:
>
>> My understanding is that we don’t need to do anything. Log4j2-core not
>> used in spark.
>>
>> > 2021年12月13日 下午12:45，Pralabh Kumar  写道：
>> >
>> > Hi developers,  users
>> >
>> > Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on
>> recent CVE detected ?
>> >
>> >
>> > Regards
>> > Pralabh kumar
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Sean Owen

It might imply that this is a way to fund Spark alone, and it isn't.
Probably no big deal either way but maybe not worth it. It won't be a
mystery how to find and fund the ASF for the few orgs that want to, as
compared to a small project

On Wed, Dec 15, 2021, 8:34 AM Maciej  wrote:

> Hi All,
>
> Just wondering ‒ would it make sense to add .github/FUNDING.yml with
> custom link pointing to one (or both) of these:
>
>- https://www.apache.org/foundation/sponsorship.html
>- https://www.apache.org/foundation/contributing.html
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Sean Owen

Parquet or ORC have the necessary stats to make this fast too already, but
only helps if you want the median of sorted data as stored on disk, rather
than the general case. Not sure you can do better than roughly what a sort
entails if you want the exact median

On Wed, Dec 15, 2021, 8:56 AM Pol Santamaria  wrote:

> Correct me if I am wrong, but If the dataset was indexed by the given
> column, you could get the median without reading the whole dataset,
> shuffling, and so on. Disclaimer (I work in Qbeast). So the issue is more
> on the data format and the possibility to push down the operation to the
> data source.
>
> On our side, we are working on an open data format that supports indexing
> and efficient sampling on data lakes (Qbeast Format), but I also know about
> other initiatives (Microsoft Hyperspace) to allow consuming indexed
> datasets with Apache Spark.
>
> If you are interested in experimenting with the median aggregate, I have
> some ideas on how to implement it for the Spark data source of Qbeast
> Format in an efficient way.
>
> [Qbeast-spark] https://github.com/Qbeast-io/qbeast-spark
> [Microsoft Hyperspace] https://github.com/microsoft/hyperspace
>
> Bests,
>
> Pol Santamaria
>
>
> On Tue, Dec 14, 2021 at 4:42 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Yeah, I think approximate percentile is good enough most of the time.
>>
>> I don't have a specific need for a precise median. I was interested in
>> implementing it more as a Catalyst learning exercise, but it turns out I
>> picked a bad learning exercise to solve. :)
>>
>> On Mon, Dec 13, 2021 at 9:46 PM Reynold Xin  wrote:
>>
>>> tl;dr: there's no easy way to implement aggregate expressions that'd
>>> require multiple pass over data. It is simply not something that's
>>> supported and doing so would be very high cost.
>>>
>>> Would you be OK using approximate percentile? That's relatively cheap.
>>>
>>>
>>>
>>> On Mon, Dec 13, 2021 at 6:43 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 No takers here? :)

 I can see now why a median function is not available in most data
 processing systems. It's pretty annoying to implement!

 On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I'm trying to create a new aggregate function. It's my first time
> working with Catalyst, so it's exciting---but I'm also in a bit over my
> head.
>
> My goal is to create a function to calculate the median
> .
>
> As a very simple solution, I could just define median to be an alias
> of `Percentile(col, 0.5)`. However, the leading comment on the
> Percentile expression
> 
> highlights that it's very memory-intensive and can easily lead to
> OutOfMemory errors.
>
> So instead of using Percentile, I'm trying to create an Expression
> that calculates the median without needing to hold everything in memory at
> once. I'm considering two different approaches:
>
> 1. Define Median as a combination of existing expressions: The median
> can perhaps be built out of the existing expressions for Count
> 
> and NthValue
> 
> .
>
> I don't see a template I can follow for building a new expression out
> of existing expressions (i.e. without having to implement a bunch of
> methods for DeclarativeAggregate or ImperativeAggregate). I also don't 
> know
> how I would wrap NthValue to make it usable as a regular aggregate
> function. The wrapped NthValue would need an implicit window that provides
> the necessary ordering.
>
>
> Is there any potential to this idea? Any pointers on how to implement
> it?
>
>
> 2. Another memory-light approach to calculating the median requires
> multiple passes over the data to converge on the answer. The approach is 
> described
> here
> .
> (I posted a sketch implementation of this approach using Spark's 
> user-level
> API here
> 
> .)
>
> I am

Re: spark jdbc

2021-12-17 Thread Sean Owen

I'm not sure we want to do that. If you "SELECT foo AS bar", then the
column name is foo but the column label is bar. We probably want to return
the latter.

On Fri, Dec 17, 2021 at 9:07 AM Gary Liu  wrote:

> In spark sql jdbc module, it's using getColumnLabel to get column names
> from the remote database, but in some databases, like SAS, it returns
> column description instead. Should getColumnName be used?
>
> This is from the SAS technical support:
>
> In the documentation,
> https://docs.oracle.com/javase/7/docs/api/java/sql/ResultSetMetaData.html
> (we adhere to the JDBC spec for the driver code )
>
>
>
> getColumnLabel() Gets the designated column's suggested title for use in
> printouts and displays.
>
> getColumnName() Get the designated column's name.
>
>
>
> In the spark code, they use
>
>
>
> while (i < ncols) {
>
>   val columnName = rsmd.getColumnLabel(i + 1)
>
>
>
> The appropriate method should be rsmd.getColumnName(i+1).
>
> --
> Gary Liu
>

Re: ivy unit test case filing for Spark

2021-12-21 Thread Sean Owen

You would have to make it available? This doesn't seem like a spark issue.

On Tue, Dec 21, 2021, 10:48 AM Pralabh Kumar  wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>

Re: About contribution

2022-01-05 Thread Sean Owen

(There is no project chat)
See https://spark.apache.org/contributing.html

On Tue, Jan 4, 2022 at 11:42 PM Dennis Jung  wrote:

> Hello, I hope this is not a silly question.
> (I couldn't find any chat room on spark project, so asking on mail)
>
> It has been about a year since using spark in work, and try to make a
> contribution to this project.
>
> I'm currently looking at documents in more detail, and checking the issue
> in JIRA now. Is there some suggestion of reviewing the code?
>
> - Which code part will be good to start?
> - What will be more helpful for the project?
>
> Thanks.
>

Re: [VOTE] Release Spark 3.2.1 (RC1)

2022-01-11 Thread Sean Owen

+1 looks good to me. I ran all tests with scala 2.12 and 2.13 and had the
same results as 3.2.0 testing.

On Mon, Jan 10, 2022 at 12:10 PM huaxin gao  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1.
>
> The vote is open until Jan. 13th at 12 PM PST (8 PM UTC) and passes if a
> majority
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.2.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 3.2.1 (try project = SPARK AND
> "Target Version/s" = "3.2.1" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v3.2.1-rc1 (commit
> 2b0ee226f8dd17b278ad11139e62464433191653):
> https://github.com/apache/spark/tree/v3.2.1-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1395/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc1-docs/
>
> The list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/7tzik
>
> This release is using the release script of the tag v3.2.1-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.2.1?
> ===
>
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

Re: Spark on Oracle available as an Apache licensed open source repo

2022-01-13 Thread Sean Owen

-user
Thank you for this, but just a small but important point about the use of
the Spark name. Please take a look at
https://spark.apache.org/trademarks.html
Specifically, this should reference "Apache Spark" at least once
prominently with a link to the project.
It's also advisable to avoid using "Spark" in a project or product name
entirely. "Oracle Translator for Apache Spark" or something like that would
be more in line with trademark guidance.

On Thu, Jan 13, 2022 at 6:50 PM Harish Butani 
wrote:

> Spark on Oracle is now available as an open source Apache licensed github
> repo . Build and deploy it as an
> extension jar in your Spark clusters.
>
> Use it to combine Apache Spark programs with data in your existing Oracle
> databases without expensive data copying or query time data movement.
>
> The core capability is Optimizer extensions that collapse SQL operator
> sub-graphs to an OraScan that executes equivalent SQL in Oracle. Physical
> plan parallelism
> can be
> controlled to split Spark tasks to operate on Oracle data block ranges, or
> on resultset pages or on table partitions.
>
> We pushdown large parts of Spark SQL to Oracle, for example 95 of 99 TPCDS
> queries are completely pushed to Oracle.
> 
>
> With Spark SQL macros
>   you can
> write custom Spark UDFs that get translated and pushed as Oracle SQL
> expressions.
>
> With DML pushdown
>  inserts in
> Spark SQL get pushed as transactionally consistent inserts/updates on
> Oracle tables.
>
> See Quick Start Guide
>   on how
> to set up an Oracle free tier ADW instance, load it with TPCDS data and try
> out the Spark on Oracle Demo
>   on your Spark
> cluster.
>
> More  details can be found in our blog
>  and the 
> project
> wiki. 
>
> regards,
> Harish Butani
>

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Sean Owen

(Are you suggesting this is a regression, or is it a general question? here
we're trying to figure out whether there are critical bugs introduced in
3.2.1 vs 3.2.0)

On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen 
wrote:

> Hi, I am wondering if it's a bug or not.
>
> I do have a lot of json files, where they have some columns that are all
> "null" on.
>
> I start spark with
>
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField,
> StringType,IntegerType
>
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>   .set("spark.hadoop.fs.s3a.impl",
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>
> spark = get_spark_session("Falk", SparkConf())
>
> d3 =
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
>
>
> (653610, 267)
>
>
> d3.write.json("d3.json")
>
>
> d3 = spark.read.json("d3.json/*.json")
>
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
>
> (653610, 186)
>
>
> So spark is deleting 81 columns. I think that all of these 81 deleted
> columns have only Null in them.
>
> Is this a bug or has this been made on purpose?
>
>
> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao :
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
>> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
>> package because ... To learn more about Apache Spark, please see
>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
>> 4f25b3f71238a00508a356591553f2dfa89f8290):
>> https://github.com/apache/spark/tree/v3.2.1-rc2
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository
>> for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>> The list of bug fixes going into 3.2.1 can be found at the following URL:
>> https://s.apache.org/yu0cy
>>
>> This release is using the release script of the tag v3.2.1-rc2. FAQ
>> = How can I help test this release?
>> = If you are a Spark user, you can help us test
>> this release by taking an existing Spark workload and running on this
>> release candidate, then reporting any regressions. If you're working in
>> PySpark you can set up a virtual env and install the current RC and see if
>> anything important breaks, in the Java/Scala you can add the staging
>> repository to your projects resolvers and test with the RC (make sure to
>> clean up the artifact cache before/after so you don't end up building with
>> a out of date RC going forward).
>> === What should happen to JIRA
>> tickets still targeting 3.2.1? ===
>> The current list of open tickets targeted at 3.2.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
>> important bug fixes, documentation, and API tweaks that impact
>> compatibility should be worked on immediately. Everything else please
>> retarget to an appropriate release. == But my bug isn't
>> fixed? == In order to make timely releases, we will
>> typically

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Sean Owen

(Bjorn - unless this is a regression, it would not block a release, even if
it's a bug)

On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen 
wrote:

> [x] -1 Do not release this package because, deletes all my columns with
> only Null in it.
>
> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for this
> bug.
>
>
>
>
> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen :
>
>> (Are you suggesting this is a regression, or is it a general question?
>> here we're trying to figure out whether there are critical bugs introduced
>> in 3.2.1 vs 3.2.0)
>>
>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen 
>> wrote:
>>
>>> Hi, I am wondering if it's a bug or not.
>>>
>>> I do have a lot of json files, where they have some columns that are all
>>> "null" on.
>>>
>>> I start spark with
>>>
>>> from pyspark import pandas as ps
>>> import re
>>> import numpy as np
>>> import os
>>> import pandas as pd
>>>
>>> from pyspark import SparkContext, SparkConf
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
>>> from pyspark.sql.types import StructType, StructField,
>>> StringType,IntegerType
>>>
>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>>
>>> def get_spark_session(app_name: str, conf: SparkConf):
>>> conf.setMaster('local[*]')
>>> conf \
>>>   .set('spark.driver.memory', '64g')\
>>>   .set("fs.s3a.access.key", "minio") \
>>>   .set("fs.s3a.secret.key", "") \
>>>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>>>   .set("spark.hadoop.fs.s3a.impl",
>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>>>   .set("spark.sql.adaptive.enabled", "True") \
>>>   .set("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer") \
>>>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>>>   .set("sc.setLogLevel", "error")
>>>
>>> return
>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>>
>>> spark = get_spark_session("Falk", SparkConf())
>>>
>>> d3 =
>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>>
>>> import pyspark
>>> def sparkShape(dataFrame):
>>> return (dataFrame.count(), len(dataFrame.columns))
>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>> print(d3.shape())
>>>
>>>
>>> (653610, 267)
>>>
>>>
>>> d3.write.json("d3.json")
>>>
>>>
>>> d3 = spark.read.json("d3.json/*.json")
>>>
>>> import pyspark
>>> def sparkShape(dataFrame):
>>> return (dataFrame.count(), len(dataFrame.columns))
>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>> print(d3.shape())
>>>
>>> (653610, 186)
>>>
>>>
>>> So spark is deleting 81 columns. I think that all of these 81 deleted
>>> columns have only Null in them.
>>>
>>> Is this a bug or has this been made on purpose?
>>>
>>>
>>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao :
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [
>>>> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not release
>>>> this package because ... To learn more about Apache Spark, please see
>>>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
>>>> 4f25b3f71238a00508a356591553f2dfa89f8290):
>>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
&

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Sean Owen

Continue on the ticket - I am not sure this is established. We would block
a release for critical problems that are not regressions. This is not a
data loss / 'deleting data' issue even if valid.
You're welcome to provide feedback but votes are for the PMC.

On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen 
wrote:

> Ok, but deleting users' data without them knowing it is never a good idea.
> That's why I give this RC -1.
>
> lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen :
>
>> (Bjorn - unless this is a regression, it would not block a release, even
>> if it's a bug)
>>
>> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen 
>> wrote:
>>
>>> [x] -1 Do not release this package because, deletes all my columns with
>>> only Null in it.
>>>
>>> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for
>>> this bug.
>>>
>>>
>>>
>>>
>>> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen :
>>>
>>>> (Are you suggesting this is a regression, or is it a general question?
>>>> here we're trying to figure out whether there are critical bugs introduced
>>>> in 3.2.1 vs 3.2.0)
>>>>
>>>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <
>>>> bjornjorgen...@gmail.com> wrote:
>>>>
>>>>> Hi, I am wondering if it's a bug or not.
>>>>>
>>>>> I do have a lot of json files, where they have some columns that are
>>>>> all "null" on.
>>>>>
>>>>> I start spark with
>>>>>
>>>>> from pyspark import pandas as ps
>>>>> import re
>>>>> import numpy as np
>>>>> import os
>>>>> import pandas as pd
>>>>>
>>>>> from pyspark import SparkContext, SparkConf
>>>>> from pyspark.sql import SparkSession
>>>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim,
>>>>> expr
>>>>> from pyspark.sql.types import StructType, StructField,
>>>>> StringType,IntegerType
>>>>>
>>>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>>>>
>>>>> def get_spark_session(app_name: str, conf: SparkConf):
>>>>> conf.setMaster('local[*]')
>>>>> conf \
>>>>>   .set('spark.driver.memory', '64g')\
>>>>>   .set("fs.s3a.access.key", "minio") \
>>>>>   .set("fs.s3a.secret.key", "") \
>>>>>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>>>>>   .set("spark.hadoop.fs.s3a.impl",
>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>>>>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>>>>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>>>>>   .set("spark.sql.adaptive.enabled", "True") \
>>>>>   .set("spark.serializer",
>>>>> "org.apache.spark.serializer.KryoSerializer") \
>>>>>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>>>>>   .set("sc.setLogLevel", "error")
>>>>>
>>>>> return
>>>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>>>>
>>>>> spark = get_spark_session("Falk", SparkConf())
>>>>>
>>>>> d3 =
>>>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>>>>
>>>>> import pyspark
>>>>> def sparkShape(dataFrame):
>>>>> return (dataFrame.count(), len(dataFrame.columns))
>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>>> print(d3.shape())
>>>>>
>>>>>
>>>>> (653610, 267)
>>>>>
>>>>>
>>>>> d3.write.json("d3.json")
>>>>>
>>>>>
>>>>> d3 = spark.read.json("d3.json/*.json")
>>>>>
>>>>> import pyspark
>>>>> def sparkShape(dataFrame):
>>>>> return (dataFrame.count(), len(dataFrame.columns))
>>>>> pyspark.sql.dataframe.DataFram

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-21 Thread Sean Owen

+1 with same result as last time.

On Thu, Jan 20, 2022 at 9:59 PM huaxin gao  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.2.1. The vote is open until 8:00pm Pacific time January 25 and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1
> Release this package as Apache Spark 3.2.1[ ] -1 Do not release this
> package because ... To learn more about Apache Spark, please see
> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2 (commit
> 4f25b3f71238a00508a356591553f2dfa89f8290):
> https://github.com/apache/spark/tree/v3.2.1-rc2
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging repository
> for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1398/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
> The list of bug fixes going into 3.2.1 can be found at the following URL:
> https://s.apache.org/yu0cy
>
> This release is using the release script of the tag v3.2.1-rc2. FAQ
> = How can I help test this release?
> = If you are a Spark user, you can help us test
> this release by taking an existing Spark workload and running on this
> release candidate, then reporting any regressions. If you're working in
> PySpark you can set up a virtual env and install the current RC and see if
> anything important breaks, in the Java/Scala you can add the staging
> repository to your projects resolvers and test with the RC (make sure to
> clean up the artifact cache before/after so you don't end up building with
> a out of date RC going forward).
> === What should happen to JIRA
> tickets still targeting 3.2.1? ===
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
> important bug fixes, documentation, and API tweaks that impact
> compatibility should be worked on immediately. Everything else please
> retarget to an appropriate release. == But my bug isn't
> fixed? == In order to make timely releases, we will
> typically not hold the release unless the bug in question is a regression
> from the previous release. That being said, if there is something which is
> a regression that has not been correctly targeted please ping me or a
> committer to help target the issue.
>

Re: Log likelhood in GeneralizedLinearRegression

2022-01-22 Thread Sean Owen

This exists in the evaluator MulticlassClassificationEvaluator instead
(which can be used for binary), does that work?

On Sat, Jan 22, 2022 at 4:36 AM Phillip Henry 
wrote:

> Hi,
>
> As far as I know, there is no function to generate the log likelihood from
> a GeneralizedLinearRegression model. Are there any plans to implement one?
>
> I've coded my own in PySpark and in testing it agrees with the values we
> get from the Python library StatsModels to one part in a million. It's
> kinda yucky code as it relies on some inefficient UDFs but I could port it
> to Scala.
>
> Would anybody be interested in me raising a PR and coding an efficient
> Scala implementation that can be called from PySpark?
>
> Regards,
>
> Phillip
>
>

Re: Log4j upgrade in spark binary from 1.2.17 to 2.17.1

2022-01-31 Thread Sean Owen

(BTW you are sending to the Spark incubator list, and Spark has not been in
incubation for about 7 years. Use u...@spark.apache.org)

What update are you looking for? this has been discussed extensively on the
Spark mailing list.
Spark is not evidently vulnerable to this. 3.3.0 will include log4j 2.17
anyway.

The ticket you cite points you to the correct ticket:
https://issues.apache.org/jira/browse/SPARK-6305

On Mon, Jan 31, 2022 at 10:53 AM KS, Rajabhupati
 wrote:

> Hi Team ,
>
>
>
> Is there any update on this request ?
>
>
>
> We did see Jira https://issues.apache.org/jira/browse/SPARK-37630 for
> this request but we see it closed .
>
>
>
> Regards
>
> Raja
>
>
>
> *From:* KS, Rajabhupati 
> *Sent:* Sunday, January 30, 2022 9:03 AM
> *To:* u...@spark.incubator.apache.org
> *Subject:* Log4j upgrade in spark binary from 1.2.17 to 2.17.1
>
>
>
> Hi Team,
>
>
>
> We were checking for log4j upgrade in Open source spark version to avoid
> the recent vulnerability in the spark binary . Do we have any new release
> which is planned to upgrade the log4j from 1.2.17 to 2.17.1.Any sooner
> response is appreciated ?
>
>
>
>
>
> Regards
>
> Rajabhupati
>

Re: [VOTE] Spark 3.1.3 RC3

2022-02-02 Thread Sean Owen

+1 from me, same result as the last release on my end.
I think releasing 3.1.3 is fine, it's 7 months since 3.1.2.


On Tue, Feb 1, 2022 at 7:12 PM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.3.
>
> The vote is open until Feb. 4th at 5 PM PST (1 AM UTC + 1 day) and passes
> if a majority
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
> Reopened, "In Progress"))
> at https://s.apache.org/n79dw
>
>
>
> The tag to be voted on is v3.2.1-rc1 (commit
> b8c0799a8cef22c56132d94033759c9f82b0cc86):
> https://github.com/apache/spark/tree/v3.1.3-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at
> :https://repository.apache.org/content/repositories/orgapachespark-1400/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc3-docs/
>
> The list of bug fixes going into 3.1.3 can be found at the following URL:
> https://s.apache.org/x0q9b
>
> This release is using the release script in master as
> of ddc77fb906cb3ce1567d277c2d0850104c89ac25
> The release docker container was rebuilt since the previous version didn't
> have the necessary components to build the R documentation.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.3?
> ===
>
> The current list of open tickets targeted at 3.2.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something that is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> ==
> What happened to RC1 & RC2?
> ==
>
> When I first went to build RC1 the build process failed due to the
> lack of the R markdown package in my local rm container. By the time
> I had time to debug and rebuild there was already another bug fix commit in
> branch-3.1 so I decided to skip ahead to RC2 and pick it up directly.
> When I went to go send the RC2 vote e-mail I noticed a correctness issue
> had
> been fixed in branch-3.1 so I rolled RC3 to contain the correctness fix.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Sean Owen

Yes I've seen this; the JVM stack size needs to be increased. I'm not sure
if it's env specific (though you and I at least have hit it, I think
others), or whether we need to change our build script.
In the pom.xml file, find "-Xss..." settings and make them something like
"-Xss4m", see if that works.

On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
wrote:

> Hi,
>
> I am not able to build Spark due to the following error :
>
> ERROR] ## Exception when compiling 543 sources to
> /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.BootstrapMethodError: call site initialization exception
> java.lang.invoke.CallSite.makeSite(CallSite.java:341)
>
> java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
>
> java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
> scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
>
> scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
>
> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
> scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
> scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
> ...
> Caused by: java.lang.StackOverflowError
> at java.lang.ref.Reference. (Reference.java:303)
> at java.lang.ref.WeakReference. (WeakReference.java:57)
> at
> java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
> (MethodType.java:1269)
> at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
> (MethodType.java:1216)
> at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
> at java.lang.invoke.MethodType.dropParameterTypes (MethodType.java:573)
> at java.lang.invoke.MethodType.replaceParameterTypes
> (MethodType.java:467)
> at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
> at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
> at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
> at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
> (MethodHandleNatives.java:307)
> at java.lang.invoke.MethodHandleNatives.linkCallSite
> (MethodHandleNatives.java:297)
> at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
> (Typers.scala:2504)
> at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
> (Typers.scala:5711)
> at scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
> (Typers.scala:500)
> at scala.tools.nsc.typechecker.Typers$Typer.typed1 (Typers.scala:5746)
> at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)
>
> I have played a lot with the scala-maven-plugin jvmArg settings at [1] but
> so far nothing helps.
> Same error for Scala 2.12 and 2.13.
>
> The command I use is: ./build/mvn install -Pkubernetes -DskipTests
>
> I need to create a distribution from master branch.
>
> Java: 1.8.0_312
> Maven: 3.8.4
> OS: Ubuntu 21.10
>
> Any hints ?
> Thank you!
>
> 1.
> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849
>

Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Sean Owen

It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
Yes univocity is used for much of the parsing.
I am not sure of the cause of the bug but it does look like one indeed. In
one case the parser is asked to read all fields, in the other, to skip one.
The pushdown helps efficiency but something is going wrong.

On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
marnix.van.den.br...@bundlesandbatches.io> wrote:

> hi all,
>
> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
> incorrectness when data contains sequences similar to the one in the
> report.
>
> I wanted to take a look at the parsing logic to see if I could spot the
> error to update the issue with more information and to possibly contribute
> a PR with a bug fix, but I got completely lost navigating my way down the
> dependencies in the Spark repository. Can someone point me in the right
> direction?
>
> I am looking for the csv parser itself, which is likely a dependency?
>
> The next question might need too much knowledge about Spark internals to
> know where to look or understand what I'd be looking at, but I am also
> looking to see if and why the implementation of the CSV parsing is
> different when columns are projected as opposed to the processing of the
> full dataframe/ The issue only occurs when projecting columns and this
> inconsistency is a worry in itself.
>
> Many thanks,
>
> Marnix
>
> 1. https://issues.apache.org/jira/browse/SPARK-38167
>
>

Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Sean Owen

I think it's another occurrence that I had to change or had to set
MAVEN_OPTS. I think this occurs in a way that this setting doesn't affect,
though I don't quite understand it. Try the stack size in test runner
configs

On Thu, Feb 10, 2022, 2:02 PM Martin Grigorov  wrote:

> Hi Sean,
>
> On Thu, Feb 10, 2022 at 5:37 PM Sean Owen  wrote:
>
>> Yes I've seen this; the JVM stack size needs to be increased. I'm not
>> sure if it's env specific (though you and I at least have hit it, I think
>> others), or whether we need to change our build script.
>> In the pom.xml file, find "-Xss..." settings and make them something like
>> "-Xss4m", see if that works.
>>
>
> It is already a much bigger value - 128m (
> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845
> )
> I've tried smaller and bigger values for all jvmArgs next to this one.
> None helped!
> I also have the feeling it is something in my environment that overrides
> these values but so far I cannot identify anything.
>
>
>
>>
>> On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
>> wrote:
>>
>>> Hi,
>>>
>>> I am not able to build Spark due to the following error :
>>>
>>> ERROR] ## Exception when compiling 543 sources to
>>> /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
>>> java.lang.BootstrapMethodError: call site initialization exception
>>> java.lang.invoke.CallSite.makeSite(CallSite.java:341)
>>>
>>> java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
>>>
>>> java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
>>> scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
>>>
>>> scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
>>>
>>> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
>>> scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
>>> scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
>>> ...
>>> Caused by: java.lang.StackOverflowError
>>> at java.lang.ref.Reference. (Reference.java:303)
>>> at java.lang.ref.WeakReference. (WeakReference.java:57)
>>> at
>>> java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
>>> (MethodType.java:1269)
>>> at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
>>> (MethodType.java:1216)
>>> at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
>>> at java.lang.invoke.MethodType.dropParameterTypes
>>> (MethodType.java:573)
>>> at java.lang.invoke.MethodType.replaceParameterTypes
>>> (MethodType.java:467)
>>> at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
>>> at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
>>> at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
>>> at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
>>> (MethodHandleNatives.java:307)
>>> at java.lang.invoke.MethodHandleNatives.linkCallSite
>>> (MethodHandleNatives.java:297)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
>>> (Typers.scala:2504)
>>> at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
>>> (Typers.scala:5711)
>>> at
>>> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
>>> (Typers.scala:500)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typed1
>>> (Typers.scala:5746)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)
>>>
>>> I have played a lot with the scala-maven-plugin jvmArg settings at [1]
>>> but so far nothing helps.
>>> Same error for Scala 2.12 and 2.13.
>>>
>>> The command I use is: ./build/mvn install -Pkubernetes -DskipTests
>>>
>>> I need to create a distribution from master branch.
>>>
>>> Java: 1.8.0_312
>>> Maven: 3.8.4
>>> OS: Ubuntu 21.10
>>>
>>> Any hints ?
>>> Thank you!
>>>
>>> 1.
>>> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849
>>>
>>

Re: [VOTE] Spark 3.1.3 RC4

2022-02-14 Thread Sean Owen

Looks good to me, same results as last RC, +1

On Mon, Feb 14, 2022 at 2:55 PM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.3.
>
> The vote is open until Feb. 18th at 1 PM pacific (9 PM GMT) and passes if
> a majority
> +1 PMC votes are cast, with a minimum of 3 + 1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no open issues targeting 3.1.3 in Spark's JIRA
> https://issues.apache.org/jira/browse
> (try project = SPARK AND "Target Version/s" = "3.1.3" AND status in (Open,
> Reopened, "In Progress"))
> at https://s.apache.org/n79dw
>
>
>
> The tag to be voted on is v3.1.3-rc4 (commit
> d1f8a503a26bcfb4e466d9accc5fa241a7933667):
> https://github.com/apache/spark/tree/v3.1.3-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at
> https://repository.apache.org/content/repositories/orgapachespark-1401
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-docs/
>
> The list of bug fixes going into 3.1.3 can be found at the following URL:
> https://s.apache.org/x0q9b
>
> This release is using the release script from 3.1.3
> The release docker container was rebuilt since the previous version didn't
> have the necessary components to build the R documentation.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.3?
> ===
>
> The current list of open tickets targeted at 3.1.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something that is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Note: I added an extra day to the vote since I know some folks are likely
> busy on the 14th with partner(s).
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Which manufacturers' GPUs support Spark?

2022-02-16 Thread Sean Owen

Spark itself does not use GPUs, and is agnostic to what GPUs exist on a
cluster, scheduled by the resource manager, and used by an application.
In practice, virtually all GPU-related use cases (for deep learning for
example) use CUDA, and this is NVIDIA-specific. Certainly, RAPIDS is from
NVIDIA.

On Wed, Feb 16, 2022 at 7:03 AM 15927907...@163.com <15927907...@163.com>
wrote:

> Hello,
> We have done some Spark GPU accelerated work using the spark-rapids
> component(https://github.com/NVIDIA/spark-rapids). However, we found that
> this component currently only supports Nvidia GPU, and on the official
> Spark website, we did not see the manufacturer's description of the GPU
> supported by spark(
> https://spark.apache.org/docs/3.2.1/configuration.html#custom-resource-scheduling-and-configuration-overview).
> So, Can Spark also support GPUs from other manufacturers? such as AMD.
> Looking forward to your reply.
>
> --
> 15927907...@163.com
>

Re: Apache Spark 3.3 Release

2022-03-03 Thread Sean Owen

I think it's fine to pursue the existing plan - code freeze in two weeks
and try to close off key remaining issues. Final release pending on how
those go, and testing, but fine to get the ball rolling.

On Thu, Mar 3, 2022 at 12:45 PM Maxim Gekk
 wrote:

> Hello All,
>
> I would like to bring on the table the theme about the new Spark release
> 3.3. According to the public schedule at
> https://spark.apache.org/versioning-policy.html, we planned to start the
> code freeze and release branch cut on March 15th, 2022. Since this date is
> coming soon, I would like to take your attention on the topic and gather
> objections that you might have.
>
> Bellow is the list of ongoing and active SPIPs:
>
> Spark SQL:
> - [SPARK-31357] DataSourceV2: Catalog API for view metadata
> - [SPARK-35801] Row-level operations in Data Source V2
> - [SPARK-37166] Storage Partitioned Join
>
> Spark Core:
> - [SPARK-20624] Add better handling for node shutdown
> - [SPARK-25299] Use remote storage for persisting shuffle data
>
> PySpark:
> - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark
>
> Kubernetes:
> - [SPARK-36057] Support Customized Kubernetes Schedulers
>
> Probably, we should finish if there are any remaining works for Spark 3.3,
> and switch to QA mode, cut a branch and keep everything on track. I would
> like to volunteer to help drive this process.
>
> Best regards,
> Max Gekk
>

Re: bazel and external/

2022-03-17 Thread Sean Owen

Just checking - there is no way to tell bazel to look somewhere else for
whatever 'external' means to it?
It's a kinda big ugly change but it's not a functional change. If anything
it might break some downstream builds that rely on the current structure
too. But such is life for developers? I don't have a strong reason we can't.

On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
 wrote:

> Hi Spark devs.
>
> The Apache Spark repo has a top level external/ directory. This is a
> reserved name for the bazel build system and it causes all sorts of
> problems: some can be worked around and some cannot (for some details on
> one that cannot see
> https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
> ).
>
> Some forks of Apache Spark use bazel as a build system. It would be nice
> if we can make this change in Apache Spark without resorting to
> complex renames/merges whenever changes are pulled from upstream.
>
> As such I proposed to rename external/ directory to want to rename the
> external/ directory to something else [SPARK-38569
> ]. I also sent a
> tentative [PR-35874 ] that
> renames external/ to vendor/.
>
> My questions to you are:
> 1. Are there any objections to renaming external to X?
> 2. Is vendor a good new name for external?
>
> Cheers,
>

Re: bazel and external/

2022-03-17 Thread Sean Owen

I sympathize, but might be less change to just rename the dir. There is
more in there like the avro reader; it's kind of miscellaneous. I think we
might want fewer rather than more top level dirs.

On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim 
wrote:

> We seem to just focus on how to avoid the conflict with the name
> "external" used in bazel. Since we consider the possibility of renaming,
> why not revisit the modules "external" contains?
>
> Looks like kinds of the modules external directory contains are 1) Docker
> 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and it
> seems to be just that Ganglia is LGPL)
>
> Would it make sense if each kind deserves a top directory? We can probably
> give better generalized names, and as a side-effect we will no longer have
> "external".
>
> On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for posting this, Alkis.
>>
>> Before the question (1) and (2), I'm curious if the Apache Spark
>> community has other downstreams using Bazel.
>>
>> To All. If there are some Bazel users with Apache Spark code, could you
>> share your practice? If you are using renaming, what is your renamed
>> directory name?
>>
>> Dongjoon.
>>
>>
>> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>>  wrote:
>>
>>> AFAIK there is not. `external` has been baked in bazel since the
>>> beginning and there is no plan from bazel devs to attempt to fix this
>>> <https://github.com/bazelbuild/bazel/issues/4508#issuecomment-724055371>
>>> .
>>>
>>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>>
>>>> Just checking - there is no way to tell bazel to look somewhere else
>>>> for whatever 'external' means to it?
>>>> It's a kinda big ugly change but it's not a functional change. If
>>>> anything it might break some downstream builds that rely on the current
>>>> structure too. But such is life for developers? I don't have a strong
>>>> reason we can't.
>>>>
>>>> On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
>>>>  wrote:
>>>>
>>>>> Hi Spark devs.
>>>>>
>>>>> The Apache Spark repo has a top level external/ directory. This is a
>>>>> reserved name for the bazel build system and it causes all sorts of
>>>>> problems: some can be worked around and some cannot (for some details on
>>>>> one that cannot see
>>>>> https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
>>>>> ).
>>>>>
>>>>> Some forks of Apache Spark use bazel as a build system. It would be
>>>>> nice if we can make this change in Apache Spark without resorting to
>>>>> complex renames/merges whenever changes are pulled from upstream.
>>>>>
>>>>> As such I proposed to rename external/ directory to want to rename the
>>>>> external/ directory to something else [SPARK-38569
>>>>> <https://issues.apache.org/jira/browse/SPARK-38569>]. I also sent a
>>>>> tentative [PR-35874 <https://github.com/apache/spark/pull/35874>]
>>>>> that renames external/ to vendor/.
>>>>>
>>>>> My questions to you are:
>>>>> 1. Are there any objections to renaming external to X?
>>>>> 2. Is vendor a good new name for external?
>>>>>
>>>>> Cheers,
>>>>>
>>>>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Sean Owen

I think we can assume that someone upgrading Kafka will be responsible for
thinking through the breaking changes. We can help by listing anything we
know could affect Spark-Kafka usage and calling those out in a release
note, for sure. I don't think we need to get into items that would affect
Kafka usage itself; focus on the connector-related issues.

On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim 
wrote:

> CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility
> with older brokers (not all KIPs).
>
> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I would like to initiate the discussion about how to deal with the
>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>> 3.3.
>>
>> We didn't care much about the upgrade of Kafka dependency since our
>> belief on Kafka client has been that the new Kafka client version should
>> have no compatibility issues with older brokers. Based on semantic
>> versioning, upgrading major versions rings an alarm for me.
>>
>> I haven't gone through changes that happened between versions, but found
>> one KIP (KIP-679
>> )
>> which may not work with older brokers with specific setup. (It's described
>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>
>> This may not be problematic for the users who upgrade both client and
>> broker altogether, but end users of Spark may be unlikely the case.
>> Computation engines are relatively easier to upgrade. Storage systems
>> aren't. End users would think the components are independent.
>>
>> I looked through the notable changes in the Kafka doc, and it does
>> mention this KIP, but it just says the default config has changed and
>> doesn't mention about the impacts. There is a link to
>> KIP, that said, everyone needs to read through the KIP wiki page for
>> details.
>>
>> Based on the context, what would be the best way to notice end users for
>> the major version upgrade of Kafka? I can imagine several options
>> including...
>>
>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>> the noticeable changes in the Kafka doc in the migration guide.
>> 2. Do 1 & spend more effort to read through all KIPs and check
>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>> KIPs (or even summarize) in the migration guide.
>> 3. Do 2 & actively override the default configs to be compatible with
>> older versions if the change of the default configs in Kafka 3.0 is
>> backward incompatible. End users should set these configs explicitly to
>> override them back.
>> 4. Do not care. End users can indicate the upgrade in the release note,
>> and we expect end users to actively check the notable changes (& KIPs) from
>> Kafka doc.
>> 5. Options not described above...
>>
>> Please take a look and provide your voice on this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. Probably this would be applied to all non-bugfix versions of
>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>> for minor versions, though.
>>
>

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Sean Owen

Well, yes, but if it requires a Kafka server-side update, it does, and that
is out of scope for us to document.
It is important that we document if and how (if we know) the client update
will impact existing Kafka installations (does it require a server-side
update or not?), and document the change itself for sure along with any
Spark-side migration notes.

On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim 
wrote:

> The thing is, it is “us” who upgrades Kafka client and makes possible
> divergence between client and broker in end users’ production env.
>
> Someone can claim that end users can downgrade the kafka-client artifact
> when building their app so that the version can be matched, but we don’t
> test anything against downgrading kafka-client version for kafka connector.
> That sounds to me we defer our work to end users.
>
> It sounds to me “someone” should refer to us, and then it is no longer a
> matter of “help”. It is a matter of “responsibility”, as you said.
>
> 2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:
>
>> I think we can assume that someone upgrading Kafka will be responsible
>> for thinking through the breaking changes. We can help by listing anything
>> we know could affect Spark-Kafka usage and calling those out in a release
>> note, for sure. I don't think we need to get into items that would affect
>> Kafka usage itself; focus on the connector-related issues.
>>
>> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> CORRECTION: in option 2, we enumerate KIPs which may bring
>>> incompatibility with older brokers (not all KIPs).
>>>
>>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> I would like to initiate the discussion about how to deal with the
>>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>>> 3.3.
>>>>
>>>> We didn't care much about the upgrade of Kafka dependency since our
>>>> belief on Kafka client has been that the new Kafka client version should
>>>> have no compatibility issues with older brokers. Based on semantic
>>>> versioning, upgrading major versions rings an alarm for me.
>>>>
>>>> I haven't gone through changes that happened between versions, but
>>>> found one KIP (KIP-679
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>>> which may not work with older brokers with specific setup. (It's described
>>>> in the "Compatibility, Deprecation, and Migration Plan" section of the 
>>>> KIP).
>>>>
>>>> This may not be problematic for the users who upgrade both client and
>>>> broker altogether, but end users of Spark may be unlikely the case.
>>>> Computation engines are relatively easier to upgrade. Storage systems
>>>> aren't. End users would think the components are independent.
>>>>
>>>> I looked through the notable changes in the Kafka doc, and it does
>>>> mention this KIP, but it just says the default config has changed and
>>>> doesn't mention about the impacts. There is a link to
>>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>>> details.
>>>>
>>>> Based on the context, what would be the best way to notice end users
>>>> for the major version upgrade of Kafka? I can imagine several options
>>>> including...
>>>>
>>>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>>>> the noticeable changes in the Kafka doc in the migration guide.
>>>> 2. Do 1 & spend more effort to read through all KIPs and check
>>>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>>>> KIPs (or even summarize) in the migration guide.
>>>> 3. Do 2 & actively override the default configs to be compatible with
>>>> older versions if the change of the default configs in Kafka 3.0 is
>>>> backward incompatible. End users should set these configs explicitly to
>>>> override them back.
>>>> 4. Do not care. End users can indicate the upgrade in the release note,
>>>> and we expect end users to actively check the notable changes (& KIPs) from
>>>> Kafka doc.
>>>> 5. Options not described above...
>>>>
>>>> Please take a look and provide your voice on this.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> ps. Probably this would be applied to all non-bugfix versions of
>>>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>>>> for minor versions, though.
>>>>
>>>

Re: Tools for regression testing

2022-03-24 Thread Sean Owen

Hm, then what are you looking for besides all the tests in Spark?

On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
wrote:

> Thanks
>
> I know what unit testing is. The question was not about unit testing. it
> was specific to regression testing
> 
>  artifacts .
>
>
> cheers,
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
> wrote:
>
>> Yes, Spark uses unit tests.
>>
>> https://app.codecov.io/gh/apache/spark
>>
>> https://en.wikipedia.org/wiki/Unit_testing
>>
>>
>>
>> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> As a matter of interest do Spark releases deploy a specific regression
>>> testing tool?
>>>
>>> Thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Deluge of GitBox emails

2022-04-04 Thread Sean Owen

I think this must be related to the Gitbox migration that just happened. It
does seem like I'm getting more emails - some are on PRs I'm attached to,
but some I don't recognize. The thing is, I'm not yet clear if they
duplicate the normal Github emails - that is if we turn them off do we have
anything?

On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas 
wrote:

> I assume I’m not the only one getting these new emails from GitBox. Is
> there a story behind that that I missed?
>
> I’d rather not get these emails on the dev list. I assume most of the list
> would agree with me.
>
> GitHub has a good set of options for following activity on the repo.
> People who want to follow conversations can easily do that without
> involving the whole dev list.
>
> Do we know who is responsible for these GitBox emails? Perhaps we need to
> file an Apache INFRA ticket?
>
> Nick
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Deluge of GitBox emails

2022-04-04 Thread Sean Owen

https://issues.apache.org/jira/browse/INFRA-23082 for those following.

On Mon, Apr 4, 2022 at 9:32 AM Nicholas Chammas 
wrote:

> I’m not familiar with GitBox, but it must be an independent thing. When
> you participate in a PR, GitHub emails you notifications directly.
>
> The GitBox emails, on the other hand, are going to the dev list. They seem
> like something setup as a repo-wide setting, or perhaps as an Apache bot
> that monitors repo activity and converts it into emails. (I’ve seen other
> projects -- I think Hadoop -- where GitHub activity is converted into
> comments on Jira.
>
> Turning off these GitBox emails should not have in impact on the usual
> GitHub emails we are all already familiar with.
>
>
> On Apr 4, 2022, at 9:47 AM, Sean Owen  wrote:
>
> I think this must be related to the Gitbox migration that just happened.
> It does seem like I'm getting more emails - some are on PRs I'm attached
> to, but some I don't recognize. The thing is, I'm not yet clear if they
> duplicate the normal Github emails - that is if we turn them off do we have
> anything?
>
> On Mon, Apr 4, 2022 at 8:44 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I assume I’m not the only one getting these new emails from GitBox. Is
>> there a story behind that that I missed?
>>
>> I’d rather not get these emails on the dev list. I assume most of the
>> list would agree with me.
>>
>> GitHub has a good set of options for following activity on the repo.
>> People who want to follow conversations can easily do that without
>> involving the whole dev list.
>>
>> Do we know who is responsible for these GitBox emails? Perhaps we need to
>> file an Apache INFRA ticket?
>>
>> Nick
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Sean Owen

(Don't cross post please)
Generally you definitely want to compile and test vs what you're running on.
There shouldn't be many binary or source incompatibilities -- these are
avoided in a major release where possible. So it may need no code change.
But I would certainly recompile just on principle!

On Thu, Apr 7, 2022 at 12:28 PM Pralabh Kumar 
wrote:

> Hi spark community
>
> I have quick question .I am planning to migrate from spark 3.0.1 to spark
> 3.2.
>
> Do I need to recompile my application with 3.2 dependencies or application
> compiled with 3.0.1 will work fine on 3.2 ?
>
>
> Regards
> Pralabh kumar
>
>

Re: CVE -2020-28458, How to upgrade datatables dependency

2022-04-13 Thread Sean Owen

You can see the files in core/src/main/resources/org/apache/spark/ui/static
- you can try dropping in the new minified versions and see if the UI is
OK.
You can open a pull request if it works to update it, in case this affects
Spark.
It looks like the smaller upgrade to 1.10.22 is also sufficient.

On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my
> local repo of Spark I would like to update DataTables to 1.11.5.
>
> Can you please help me to point out where I should upgrade DataTables
> dependency ?.
>
> Regards
> Pralabh Kumar
>

Re: CVE-2021-38296: Apache Spark Key Negotiation Vulnerability - 2.4 Backport?

2022-04-14 Thread Sean Owen

It does affect 2.4.x, yes. 2.4.x was EOL a while ago, so there wouldn't be
a new release of 2.4.x in any event. It's recommended to update instead, at
least to 3.1.3.

On Thu, Apr 14, 2022 at 12:07 PM Chris Nauroth  wrote:

> A fix for CVE-2021-38296 was committed and released in Apache Spark 3.1.3.
> I'm curious, is the issue relevant to the 2.4 version line, and if so, are
> there any plans for a backport?
>
> https://lists.apache.org/thread/70x8fw2gx3g9ty7yk0f2f1dlpqml2smd
>
> Chris Nauroth
>

Re: CVE -2020-28458, How to upgrade datatables dependency

2022-04-16 Thread Sean Owen

FWIW here's an update to 1.10.25: https://github.com/apache/spark/pull/36226


On Wed, Apr 13, 2022 at 8:28 AM Sean Owen  wrote:

> You can see the files in
> core/src/main/resources/org/apache/spark/ui/static - you can try dropping
> in the new minified versions and see if the UI is OK.
> You can open a pull request if it works to update it, in case this affects
> Spark.
> It looks like the smaller upgrade to 1.10.22 is also sufficient.
>
> On Wed, Apr 13, 2022 at 7:43 AM Pralabh Kumar 
> wrote:
>
>> Hi Dev Team
>>
>> Spark 3.2 (and 3.3 might also) have CVE 2020-28458.  Therefore  in my
>> local repo of Spark I would like to update DataTables to 1.11.5.
>>
>> Can you please help me to point out where I should upgrade DataTables
>> dependency ?.
>>
>> Regards
>> Pralabh Kumar
>>
>

Re: CVE-2021-22569

2022-05-04 Thread Sean Owen

Sure, did you search the JIRA?
https://issues.apache.org/jira/browse/SPARK-38340

Does this affect Spark's usage of protobuf?

Looks like it can't be updated to 3.x -- this is really not a dependency of
Spark but underlying dependencies.
Feel free to re-attempt a change that might work, at least with Hadoop 3 if
possible.

On Wed, May 4, 2022 at 10:46 AM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Spark is using protobuf 2.5.0 which is vulnerable to CVE-2021-22569. CVE
> recommends to use protobuf 3.19.2
>
> Please let me know , if there is a jira to track the update w.r.t CVE and
> Spark or should I create the one ?
>
> Regards
> Pralabh Kumar
>

Re: CVE-2020-13936

2022-05-05 Thread Sean Owen

This is a Velocity issue. Spark doesn't use it, although it looks like Avro
does. From reading the CVE, I do not believe it would impact Avro's usage -
velocity templates it may use for codegen aren't exposed that I know of. Is
there a known relationship to Spark here? That is the key question in
security questions like this.

In any event, to pursue an update, it would likely have to start by
updating Avro if it hasn't already, and if it has, pursue upgrading Avro in
Spark -- if the supported Hadoop versions work with it.

On Thu, May 5, 2022 at 12:32 PM Pralabh Kumar 
wrote:

> Hi Dev Team
>
> Please let me know if  there is a jira to track this CVE changes with
> respect to Spark  . Searched jira but couldn't find anything.
>
> Please help
>
> Regards
> Pralabh Kumar
>

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-05 Thread Sean Owen

I'm seeing test failures; is anyone seeing ones like this? This is Java 8 /
Scala 2.12 / Ubuntu 22.04:

- SPARK-37618: Sub dirs are group writable when removing from shuffle
service enabled *** FAILED ***
  [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
(DiskBlockManagerSuite.scala:155)

- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

 Function 'bloom_filter_agg', Expression class
'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
"" did not start with "
  Examples:
  " (ExpressionInfoSuite.scala:142)

On Thu, May 5, 2022 at 6:01 AM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 10th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc1 (commit
> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
> https://github.com/apache/spark/tree/v3.3.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1402
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc1.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Sean Owen

There's a -1 vote here, so I think this RC fails anyway.

On Fri, May 6, 2022 at 10:30 AM Gengliang Wang  wrote:

> Hi Maxim,
>
> Thanks for the work!
> There is a bug fix from Bruce merged on branch-3.3 right after the RC1 is
> cut:
> SPARK-39093: Dividing interval by integral can result in codegen
> compilation error
> <https://github.com/apache/spark/commit/fd998c8a6783c0c8aceed8dcde4017cd479e42c8>
>
> So -1 from me. We should have RC2 to include the fix.
>
> Thanks
> Gengliang
>
> On Fri, May 6, 2022 at 6:15 PM Maxim Gekk
>  wrote:
>
>> Hi Dongjoon,
>>
>>  > https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> > Since RC1 is started, could you move them out from the 3.3.0 milestone?
>>
>> I have removed the 3.3.0 label from Fix version(s). Thank you, Dongjoon.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Fri, May 6, 2022 at 11:06 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, Sean.
>>> It's interesting. I didn't see those failures from my side.
>>>
>>> Hi, Maxim.
>>> In the following link, there are 17 in-progress and 6 to-do JIRA issues
>>> which look irrelevant to this RC1 vote.
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> Since RC1 is started, could you move them out from the 3.3.0 milestone?
>>> Otherwise, we cannot distinguish new real blocker issues from those
>>> obsolete JIRA issues.
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Thu, May 5, 2022 at 11:46 AM Adam Binford  wrote:
>>>
>>>> I looked back at the first one (SPARK-37618), it expects/assumes a 0022
>>>> umask to correctly test the behavior. I'm not sure how to get that to not
>>>> fail or be ignored with a more open umask.
>>>>
>>>> On Thu, May 5, 2022 at 1:56 PM Sean Owen  wrote:
>>>>
>>>>> I'm seeing test failures; is anyone seeing ones like this? This is
>>>>> Java 8 / Scala 2.12 / Ubuntu 22.04:
>>>>>
>>>>> - SPARK-37618: Sub dirs are group writable when removing from shuffle
>>>>> service enabled *** FAILED ***
>>>>>   [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
>>>>> OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
>>>>> (DiskBlockManagerSuite.scala:155)
>>>>>
>>>>> - Check schemas for expression examples *** FAILED ***
>>>>>   396 did not equal 398 Expected 396 blocks in result file but got
>>>>> 398. Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>>>
>>>>>  Function 'bloom_filter_agg', Expression class
>>>>> 'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
>>>>> "" did not start with "
>>>>>   Examples:
>>>>>   " (ExpressionInfoSuite.scala:142)
>>>>>
>>>>> On Thu, May 5, 2022 at 6:01 AM Maxim Gekk
>>>>>  wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>  version 3.3.0.
>>>>>>
>>>>>> The vote is open until 11:59pm Pacific time May 10th and passes if a
>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v3.3.0-rc1 (commit
>>>>>> 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
>>>>>> https://github.com/apache/spark/tree/v3.3.0-rc1
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1402

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-16 Thread Sean Owen

I'm still seeing failures related to the function registry, like:

ExpressionsSchemaSuite:
- Check schemas for expression examples *** FAILED ***
  396 did not equal 398 Expected 396 blocks in result file but got 398. Try
regenerating the result files. (ExpressionsSchemaSuite.scala:161)

- SPARK-14415: All functions should have own descriptions *** FAILED ***
  "Function: bloom_filter_aggClass:
org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
the result) (QueryTest.scala:54)

There seems to be consistently a difference of 2 in the list of expected
functions and actual. I haven't looked closely, don't know this code. I'm
on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
something weird to do with case sensitivity, hidden files lurking
somewhere, etc.

I suspect it's not a 'real' error as the Linux-based testers work fine, but
I also can't think of why this is failing.



On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 19th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc2 (commit
> c8c657b922ac8fd8dcf9553113e11a80079db059):
> https://github.com/apache/spark/tree/v3.3.0-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1403
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC3)

2022-05-25 Thread Sean Owen

+1 works for me as usual, with Java 8 + Scala 2.12, Java 11 + Scala 2.13.

On Tue, May 24, 2022 at 12:14 PM Maxim Gekk
 wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time May 27th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc3 (commit
> a7259279d07b302a51456adb13dc1e41a6fd06ed):
> https://github.com/apache/spark/tree/v3.3.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1404
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc3.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

Re: [VOTE] Release Spark 3.3.0 (RC4)

2022-06-03 Thread Sean Owen

In Scala 2.13, I'm getting errors like this:

 analyzer should replace current_timestamp with literals *** FAILED ***
  java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
cannot be cast to class scala.collection.immutable.Seq
(scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
are in unnamed module of loader 'app')
  at
org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
...
- analyzer should replace current_date with literals *** FAILED ***
  java.lang.ClassCastException: class scala.collection.mutable.ArrayBuffer
cannot be cast to class scala.collection.immutable.Seq
(scala.collection.mutable.ArrayBuffer and scala.collection.immutable.Seq
are in unnamed module of loader 'app')
...

I haven't investigated yet, just flagging in case anyone knows more about
it immediately.


On Fri, Jun 3, 2022 at 9:54 AM Maxim Gekk 
wrote:

> Please vote on releasing the following candidate as
> Apache Spark version 3.3.0.
>
> The vote is open until 11:59pm Pacific time June 7th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.3.0-rc4 (commit
> 4e3599bc11a1cb0ea9fc819e7f752d2228e54baf):
> https://github.com/apache/spark/tree/v3.3.0-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1405
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc4-docs/
>
> The list of bug fixes going into 3.3.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> This release is using the release script of the tag v3.3.0-rc4.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.0?
> ===
> The current list of open tickets targeted at 3.3.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1629 matches

Mail list logo