Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Tom Graves
 +1
Tom

On Sunday, March 31, 2024 at 10:09:28 PM CDT, Ruifeng Zheng 
 wrote:  
 
 +1

On Mon, Apr 1, 2024 at 10:06 AM Haejoon Lee 
 wrote:

+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:

Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
Connect) 
JIRAPrototypeSPIP doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.



-- 
Ruifeng Zheng
E-mail: zrfli...@gmail.com
  

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-13 Thread Tom Graves
 Similar as others,  will be interested in working out api's and details but 
overall in favor of it.

+1
Tom Graves
On Monday, March 11, 2024 at 11:25:38 AM CDT, Mridul Muralidharan 
 wrote:  
 
 
  I am supportive of the proposal - this is a step in the right direction 
!Additional metadata (explicit and inferred) for log records, and exposing them 
for indexing is extremely useful.
The specifics of the API still need some work IMO and does not need to be this 
disruptive, but I consider that is orthogonal to this vote itself - and 
something we need to iterate upon during PR reviews.
+1

Regards,Mridul

On Mon, Mar 11, 2024 at 11:09 AM Mich Talebzadeh  
wrote:

+1
Mich Talebzadeh,Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
Von Braun)".


On Mon, 11 Mar 2024 at 09:27, Hyukjin Kwon  wrote:

+1

On Mon, 11 Mar 2024 at 18:11, yangjie01  wrote:


+1

 

Jie Yang

 

发件人:Haejoon Lee 
日期:2024年3月11日星期一 17:09
收件人:Gengliang Wang 
抄送:dev 
主题:Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

 

+1

 

On Mon, Mar 11, 2024 at 10:36 AM Gengliang Wang  wrote:


Hi all,

I'd like to start the vote for SPIP: Structured Logging Framework for Apache 
Spark

 
References:
   
   - JIRA ticket
   - SPIP doc
   - Discussion thread

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!

Gengliang Wang




  

Re: [Spark-Core] Improving Reliability of spark when Executors OOM

2024-01-17 Thread Tom Graves
 It is interesting. I think there are definitely some discussion points around 
this.  reliability vs performance is always a trade off and its great it 
doesn't fail but if it doesn't meet someone's SLA now that could be as bad if 
its hard to figure out why.   I think if something like this kicks in, it needs 
to be very obvious to the user so they can see that it occurred.  Do you have 
something in place on UI or something that indicates this? The nice thing is 
also you aren't wasting memory by increasing it for all tasks when maybe you 
only need it for one or two.  The downside is you are only finding out after 
failure.
I do also worry a little bit that in your blog post, the error you pointed out 
isn't a java OOM but an off heap memory issue (overhead + heap usage).  You 
don't really address heap memory vs off heap in that article.  Only thing I see 
mentioned is spark.executor.memory which is heap memory.  Obviously adjusting 
to only run one task is going to give that task more overall memory but the 
reasons its running out in the first place could be different.  If it was on 
heap memory for instance with more tasks I would expect to see more GC and not 
executor OOM.  If you are getting executor OOM you are likely using more off 
heap memory/stack space, etc then you allocated.   Ultimately it would be nice 
to know why that is happening and see if we can address it to not fail in the 
first place.  That could be extremely difficult though, especially if using 
software outside Spark that is using that memory.
As Holden said,  we need to make sure this would play nice with the resource 
profiles, or potentially if we can use the resource profile functionality.  
Theoretically you could extend this to try to get new executor if using dynamic 
allocation for instance.  

I agree doing a SPIP would be a good place to start to have more discussions.
Tom
On Wednesday, January 17, 2024 at 12:47:51 AM CST, kalyan 
 wrote:  
 
 Hello All,
At Uber, we had recently, done some work on improving the reliability of spark 
applications in scenarios of fatter executors going out of memory and leading 
to application failure. Fatter executors are those that have more than 1 task 
running on it at a given time concurrently. This has significantly improved the 
reliability of many spark applications for us at Uber. We made a blog about 
this recently. Link: 
https://www.uber.com/en-US/blog/dynamic-executor-core-resizing-in-spark/
At a high level, we have done the below changes:   
   - When a Task fails with the OOM of an executor, we update the core 
requirements of the task to max executor cores. 
   - When the task is picked for rescheduling, the new attempt of the task 
happens to be on an executor where no other task can run concurrently. All 
cores get allocated to this task itself.
   - This way we ensure that the configured memory is completely at the 
disposal of a single task. Thus eliminating contention of memory.
The best part of this solution is that it's reactive. It kicks in only when the 
executors fail with the OOM exception.
We understand that the problem statement is very common and we expect our 
solution to be effective in many cases. There could be more cases that can be 
covered. Executor failing with OOM is like a hard signal. The framework(making 
the driver aware of what's happening with the executor) can be extended to 
handle scenarios of other forms of memory pressure like excessive spilling to 
disk, etc. 
While we had developed this on Spark 2.4.3 in-house, we would like to 
collaborate and contribute this work to the latest versions of Spark.
What is the best way forward here? Will an SPIP proposal to detail the changes 
help?
Regards,Kalyan.Uber India.  

Re: Apache Spark 3.3.4 EOL Release?

2023-12-04 Thread Tom Graves
 +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
Tom
On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.

Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022, branch-3.3 has 
been maintained and served well until now.

- https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on Jun 9th, 2022)
- https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm (vote result 
on June 14th, 2022)

As of today, branch-3.3 has 56 additional patches after v3.3.3 (tagged on Aug 
3rd about 4 month ago) and reaches the end-of-life this month according to the 
Apache Spark release cadence, https://spark.apache.org/versioning-policy.html .

$ git log --oneline v3.3.3..HEAD | wc -l
56

Along with the recent Apache Spark 3.4.2 release, I hope the users can get a 
chance to have these last bits of Apache Spark 3.3.x, and I'd like to propose 
to have Apache Spark 3.3.4 EOL Release vote on December 11th and volunteer as 
the release manager.

WDTY?

Please let us know if you need more patches on branch-3.3.

Thanks,
Dongjoon.
  

Re: Apache Spark 3.2.4 EOL Release?

2023-04-05 Thread Tom Graves
 +1
Tom
On Tuesday, April 4, 2023 at 12:25:13 PM CDT, Dongjoon Hyun 
 wrote:  
 
 Hi, All.

Since Apache Spark 3.2.0 passed RC7 vote on October 12, 2021, branch-3.2 has 
been maintained and served well until now.

- https://github.com/apache/spark/releases/tag/v3.2.0 (tagged on Oct 6, 2021)
- https://lists.apache.org/thread/jslhkh9sb5czvdsn7nz4t40xoyvznlc7

As of today, branch-3.2 has 62 additional patches after v3.2.3 and reaches the 
end-of-life this month according to the Apache Spark release cadence. 
(https://spark.apache.org/versioning-policy.html)

    $ git log --oneline v3.2.3..HEAD | wc -l
    62

With the upcoming Apache Spark 3.4, I hope the users can get a chance to have 
these last bits of Apache Spark 3.2.x, and I'd like to propose to have Apache 
Spark 3.2.4 EOL Release next week and volunteer as the release manager. WDTY? 
Please let me know if you need more patches on branch-3.2.

Thanks,
Dongjoon.
  

Re: [VOTE] Release Apache Spark 3.4.0 (RC1)

2023-02-22 Thread Tom Graves
 It looks like there are still blockers open, we need to make sure they are 
addressed before doing a release:
https://issues.apache.org/jira/browse/SPARK-41793
https://issues.apache.org/jira/browse/SPARK-42444

TomOn Tuesday, February 21, 2023 at 10:35:45 PM CST, Xinrong Meng 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.4.0.

The vote is open until 11:59pm Pacific time February 27th and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.4.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.4.0-rc1 (commit 
e2484f626bb338274665a49078b528365ea18c3b):
https://github.com/apache/spark/tree/v3.4.0-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1435

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc1-docs/

The list of bug fixes going into 3.4.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12351465

This release is using the release script of the tag v3.4.0-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.4.0?
===
The current list of open tickets targeted at 3.4.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.4.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Thanks,
Xinrong Meng
  

Re: Depolying stage-level scheduling for Spark SQL

2022-10-03 Thread Tom Graves
 1) In my opinion this is to complex for the average user. In this case I'm 
assuming you have some sort of optimizer that would apply and do it 
automatically for the user?  If its just in the research stage of things can 
you just modify Spark to do experiments?
2) I think the main thing is having the heuristics and logic for changing what 
the user requested.  it sounds like you might be working on a component to do 
this but I didn't read the paper you pointed to yet either.
Also note there are already plugin points into Spark to add rules to optimizer 
and physical plan for columnar, it sounds to me you might be working on 
something that might fit better as a plugin if it automatically figures out 
what it thinks the best thing is.  If this is the case I go back to number 1 
above, can you modify spark to have the plugin point you need to do your 
experimentation to see if it makes sense.
TomOn Friday, September 30, 2022, 11:31:35 AM CDT, Chenghao Lyu 
 wrote:  
 
 Thanks for the clarification Tom!

A bit more backgrounds for what we want to do: we have proposed a fine-grained 
(stage-level) resource optimization approach in VLDB22 
https://www.vldb.org/pvldb/vol15/p3098-lyu.pdf and would like to try it over 
Spark. Our approach can recommend the resource configuration for each stage 
automatically (by using ML and our optimization framework), and we would like 
to see how to embed it in Spark. Initially, we consider that there is no AQE to 
make it simpler. 

Now I see the problem in two folds (In both cases, the stage-level 
configurations will be automatically configured by our algorithm with the the 
upper and lower bounds of each tunable resource given by a user):

(1) If AQE is disabled in Spark SQL, and hence the RDD DAG will not be changed 
after the physical plan is selected, do you think it is feasible and worth 
exposing the RDDs and reusing the existing stage-level scheduling API for 
optimization? 
(2) If AQE is enabled in Spark SQL, I would agree and prefer to add the 
stage-level resource optimization inside the AQE. Since I am not very 
experienced with the AQE part, would you list more potential challenges it may 
lead to? 

Thanks in advance and I would really appreciate it if you could give us more 
feedback!
Cheers, ChenghaoOn Sep 30, 2022, 4:22 PM +0200, Tom Graves 
, wrote:

see the original SPIP for as to why we only support RDD: 
https://issues.apache.org/jira/browse/SPARK-27495

The main problem is exactly what you are referring to. The RDD level is not 
exposed to the user when using SQL or Dataframe API. This is on purpose and 
user shouldn't have to know anything about the underlying impelementation using 
RDDs. Especially with AQE and other optimizations that could change things. You 
may start out with one physical plan and AQE can change it along the way, so 
how does user change RDD at that point?   It would be very difficult to expose 
this to the user and I don't think it should be.  I think we would have to come 
up with some other way to apply stage level scheduling to SQL/dataframe, or 
like mentioned in original issue if AQE gets smart enough it would just do it 
for the user, but lots of factors that come into play that make that difficult 
as well.
Tom On Friday, September 30, 2022, 04:15:36 AM CDT, Chenghao Lyu 
 wrote:

Thanks for the reply! 

To clarify, for issue 2, it could still break apart a query into multiple jobs 
without AQE — I have turned off the AQE in my posted example. 

For 1, an end user just needs to turn on/off a knob to use the stage-level 
scheduling for Spark SQL — I am considering adding a component between the 
Spark SQL module and the Spark Core model to optimize the stage-level resource. 

Yes, SQL is declarative. It uses a sequence of components (such as a logical 
planner, physical planner, and CBO) to get a selected physical plan. The RDDs 
(with the transformations) are generated based on the selected physical plan 
for execution. For now, we could only get the top-level RDD of the DAG of RDDs 
by `spark.sql(q1).queryExecution.toRdd`, but it is not enough to make 
stage-level scheduling decisions. The stage-level resources are profiled based 
on the RDDs. If we could expose the all RDDs instead of the top-level RDD, it 
seems possible to apply the stage-level scheduling here.


P.S. let me attach the link for the RDD regeneration explicitly in case it is 
not shown on the mail-list website: 
https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql
Cheers,ChenghaoOn Sep 29, 2022, 5:22 PM +0200, Herman van Hovell 
, wrote:

I think issue 2 is caused by adaptive query execution. This will break apart 
queries into multiple jobs, each subsequent job will generate a RDD that is 
based on previous ones. 
As for 1. I am not sure how much you want to expose to an end user here. SQL is 
declarative, and it does not specify how a query should be executed. I can 
imagine that you might use different

Re: Depolying stage-level scheduling for Spark SQL

2022-09-30 Thread Tom Graves
 see the original SPIP for as to why we only support RDD: 
https://issues.apache.org/jira/browse/SPARK-27495

The main problem is exactly what you are referring to. The RDD level is not 
exposed to the user when using SQL or Dataframe API. This is on purpose and 
user shouldn't have to know anything about the underlying impelementation using 
RDDs. Especially with AQE and other optimizations that could change things. You 
may start out with one physical plan and AQE can change it along the way, so 
how does user change RDD at that point?   It would be very difficult to expose 
this to the user and I don't think it should be.  I think we would have to come 
up with some other way to apply stage level scheduling to SQL/dataframe, or 
like mentioned in original issue if AQE gets smart enough it would just do it 
for the user, but lots of factors that come into play that make that difficult 
as well.
Tom On Friday, September 30, 2022, 04:15:36 AM CDT, Chenghao Lyu 
 wrote:  
 
 Thanks for the reply! 

To clarify, for issue 2, it could still break apart a query into multiple jobs 
without AQE — I have turned off the AQE in my posted example. 

For 1, an end user just needs to turn on/off a knob to use the stage-level 
scheduling for Spark SQL — I am considering adding a component between the 
Spark SQL module and the Spark Core model to optimize the stage-level resource. 

Yes, SQL is declarative. It uses a sequence of components (such as a logical 
planner, physical planner, and CBO) to get a selected physical plan. The RDDs 
(with the transformations) are generated based on the selected physical plan 
for execution. For now, we could only get the top-level RDD of the DAG of RDDs 
by `spark.sql(q1).queryExecution.toRdd`, but it is not enough to make 
stage-level scheduling decisions. The stage-level resources are profiled based 
on the RDDs. If we could expose the all RDDs instead of the top-level RDD, it 
seems possible to apply the stage-level scheduling here.


P.S. let me attach the link for the RDD regeneration explicitly in case it is 
not shown on the mail-list website: 
https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql
Cheers,ChenghaoOn Sep 29, 2022, 5:22 PM +0200, Herman van Hovell 
, wrote:

I think issue 2 is caused by adaptive query execution. This will break apart 
queries into multiple jobs, each subsequent job will generate a RDD that is 
based on previous ones. 
As for 1. I am not sure how much you want to expose to an end user here. SQL is 
declarative, and it does not specify how a query should be executed. I can 
imagine that you might use different resources for different types of stages, 
e.g. a scan stage and more compute heavy stages. This, IMO, should be based on 
analysis and costing the plan. For this RDD only stage level scheduling should 
be sufficient.
On Thu, Sep 29, 2022 at 8:56 AM Chenghao Lyu  wrote:

Hi, 

I plan to deploy the stage-level scheduling for Spark SQL to apply some 
fine-grained optimizations over the DAG of stages. However, I am blocked by the 
following issues:   
   - The current stage-level scheduling supports RDD APIs only. So is there a 
way to reuse the stage-level scheduling for Spark SQL? E.g., how to expose the 
RDD code (the transformations and actions) from a Spark SQL (with SQL syntax)?
   - We do not quite understand why a Spark SQL could trigger multiple jobs, 
and have some RDDs regenerated, as posted in here. Can anyone give us some 
insight on the reasons and whether we can avoid the RDD regeneration to save 
execution time? 
Thanks in advance.
Cheers, Chenghao

  

Re: How to set platform-level defaults for array-like configs?

2022-08-11 Thread Tom Graves
 A few years ago when I was doing more deployment management I kicked around 
the idea of having different types of configs or different ways to specify the 
configs.  Though one of the problems at the time was actually with users 
specifying a properties file and not picking up the spark-defaults.conf.    So 
I was thinking about creating like a spark-admin.conf or something to that 
nature.
 I think there is benefit in it, it just comes down to how to implement it 
best.  The other thing I don't think I saw addressed was the the ability 
prevent user from overriding configs.  If you just do the defaults I presume 
users could still override it.  That gets a bit trickier especially if they can 
override the entire spark-defaults.conf file. 

TomOn Thursday, August 11, 2022, 12:16:10 PM CDT, Mridul Muralidharan 
 wrote:  
 
 
Hi,
  Wenchen, would be great if you could chime in with your thoughts - given the 
feedback you originally had on the PR.It would be great to hear feedback from 
others on this, particularly folks managing spark deployments - how this is 
mitigated/avoided in your case, any other pain points with configs in this 
context.

Regards,Mridul
On Wed, Jul 27, 2022 at 12:28 PM Erik Krogen  wrote:

I find there's substantial value in being able to set defaults, and I think we 
can see that the community finds value in it as well, given the handful of 
"default"-like configs that exist today as mentioned in Shardul's email. The 
mismatch of conventions used today (suffix with ".defaultList", change "extra" 
to "default", ...) is confusing and inconsistent, plus requires one-off 
additions for each config.
My proposal here would be:   
   - Define a clear convention, e.g. a suffix of ".default" that enables a 
default to be set and merged
   - Document this convention in configuration.md so that we can avoid 
separately documenting each default-config, and instead just add a note in the 
docs for the normal config.
   - Adjust the withPrepended method added in #24804 to leverage this 
convention instead of each usage instance re-defining the additional config name
   - Do a comprehensive review of applicable configs and enable them all to use 
the newly updated withPrepended method
Wenchen, you expressed some concerns with adding more default configs in 
#34856, would this proposal address those concerns?
Thanks,Erik
On Wed, Jul 13, 2022 at 11:54 PM Shardul Mahadik  
wrote:

Hi Spark devs,

Spark contains a bunch of array-like configs (comma separated lists). Some 
examples include `spark.sql.extensions`, `spark.sql.queryExecutionListeners`, 
`spark.jars.repositories`, `spark.extraListeners`, 
`spark.driver.extraClassPath` and so on (there are a dozen or so more). As 
owners of the Spark platform in our organization, we would like to set 
platform-level defaults, e.g. custom SQL extension and listeners, and we use 
some of the above mentioned properties to do so. At the same time, we have 
power users writing their own listeners, setting the same Spark confs and thus 
unintentionally overriding our platform defaults. This leads to a loss of 
functionality within our platform.

Previously, Spark has introduced "default" confs for a few of these array-like 
configs, e.g. `spark.plugins.defaultList` for `spark.plugins`, 
`spark.driver.defaultJavaOptions` for `spark.driver.extraJavaOptions`. These 
properties are meant to only be set by cluster admins thus allowing separation 
between platform default and user configs. However, as discussed in 
https://github.com/apache/spark/pull/34856, these configs are still client-side 
and can still be overridden, while also not being a scalable solution as we 
cannot introduce 1 new "default" config for every array-like config.

I wanted to know if others have experienced this issue and what systems have 
been implemented to tackle this. Are there any existing solutions for this; 
either client-side or server-side? (e.g. at job submission server). Even though 
we cannot easily enforce this at the client-side, the simplicity of a solution 
may make it more appealing. 

Thanks,
Shardul


  

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Tom Graves
 +1
Tom
On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.3.0.

The vote is open until 11:59pm Pacific time June 14th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.3.0-rc6 (commit 
f74867bddfbcdd4d08076db36851e88b15e66556):
https://github.com/apache/spark/tree/v3.3.0-rc6

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1407

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/

The list of bug fixes going into 3.3.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12350369
This release is using the release script of the tag v3.3.0-rc6.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.0?
===
The current list of open tickets targeted at 3.3.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Maxim Gekk

Software Engineer

Databricks, Inc.
  

Re: [VOTE] Release Spark 3.3.0 (RC3)

2022-05-27 Thread Tom Graves
 +1. Ran through internal tests.
Tom Graves
On Tuesday, May 24, 2022, 12:13:56 PM CDT, Maxim Gekk 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.3.0.

The vote is open until 11:59pm Pacific time May 27th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.3.0-rc3 (commit 
a7259279d07b302a51456adb13dc1e41a6fd06ed):
https://github.com/apache/spark/tree/v3.3.0-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1404

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc3-docs/

The list of bug fixes going into 3.3.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12350369
This release is using the release script of the tag v3.3.0-rc3.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.0?
===
The current list of open tickets targeted at 3.3.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Maxim Gekk

Software Engineer

Databricks, Inc.
  

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Tom Graves
 Is there going to be an rc2? I thought there were a couple of issue mentioned 
in the thread.
Tom
On Tuesday, May 10, 2022, 11:53:36 AM CDT, Maxim Gekk 
 wrote:  
 
 Hi All,
Today is the last day for voting. Please, test the RC1 and vote.
Maxim Gekk

Software Engineer

Databricks, Inc.


On Sat, May 7, 2022 at 10:58 AM beliefer  wrote:





 @Maxim Gekk  Glad to hear that!

But there is a bug https://github.com/apache/spark/pull/36457
I think we should merge it into 3.3.0



At 2022-05-05 19:00:27, "Maxim Gekk"  wrote:

Please vote on releasing the following candidate as Apache Spark version 3.3.0.

The vote is open until 11:59pm Pacific time May 10th and passes if a majority 
+1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.3.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.3.0-rc1 (commit 
482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
https://github.com/apache/spark/tree/v3.3.0-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1402

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/

The list of bug fixes going into 3.3.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12350369
This release is using the release script of the tag v3.3.0-rc1.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.3.0?
===
The current list of open tickets targeted at 3.3.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.3.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
Maxim Gekk

Software Engineer

Databricks, Inc.




 

  

Re: Apache Spark 3.3 Release

2022-03-21 Thread Tom Graves
 Maybe I'm miss understanding what you are saying, according to those dates 
code freeze, which should be majority of features are merged is March 15th. So 
if this list is all features and not merged at this point we should probably 
discuss if we want them to go in or if we need to change the dates.  Major 
features going in during QA period can destabilize things.
Tom
On Monday, March 21, 2022, 01:53:24 AM CDT, Wenchen Fan 
 wrote:  
 
 Just checked the release calendar, the planned RC cut date is April:
Let's revisit after 2 weeks then?
On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan  wrote:

Shall we revisit this list after a week? Ideally, they should be either merged 
or rejected for 3.3, so that we can cut rc1. We can still discuss them case by 
case at that time if there are exceptions.
On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun  wrote:

Thank you for your summarization.

I believe we need to have a discussion in order to evaluate each PR's readiness.

BTW, `branch-3.3` is still open for bug fixes including minor dependency 
changes like the following.

(Backported)[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4
Revert "[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.4"
[SPARK-38563][PYTHON] Upgrade to Py4J 0.10.9.5

(Upcoming)
[SPARK-38544][BUILD] Upgrade log4j2 to 2.17.2 from 2.17.1
[SPARK-38602][BUILD] Upgrade Kafka to 3.1.1 from 3.1.0
Dongjoon.


On Thu, Mar 17, 2022 at 11:22 PM Maxim Gekk  wrote:

Hi All,
Here is the allow list which I built based on your requests in this thread:   
   - SPARK-37396: Inline type hint files for files in python/pyspark/mllib
   - SPARK-37395: Inline type hint files for files in python/pyspark/ml
   - SPARK-37093: Inline type hints python/pyspark/streaming
   - SPARK-37377: Refactor V2 Partitioning interface and remove deprecated 
usage of Distribution
   - SPARK-38085: DataSource V2: Handle DELETE commands for group-based sources
   - SPARK-32268: Bloom Filter Join
   - SPARK-38548: New SQL function: try_sum
   - SPARK-37691: Support ANSI Aggregation Function: percentile_disc
   - SPARK-38063: Support SQL split_part function
   - SPARK-28516: Data Type Formatting Functions: `to_char`
   - SPARK-38432: Refactor framework so as JDBC dialect could compile filter by 
self way
   - SPARK-34863: Support nested column in Spark Parquet vectorized readers
   - SPARK-38194: Make Yarn memory overhead factor configurable
   - SPARK-37618: Support cleaning up shuffle blocks from external shuffle 
service
   - SPARK-37831: Add task partition id in metrics
   - SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and 
DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
   - SPARK-36664: Log time spent waiting for cluster resources
   - SPARK-34659: Web UI does not correctly get appId
   - SPARK-37650: Tell spark-env.sh the python interpreter
   - SPARK-38589: New SQL function: try_avg
   - SPARK-38590: New SQL function: try_to_binary   

   - SPARK-34079: Improvement CTE table scan   

Best regards,Max Gekk

On Thu, Mar 17, 2022 at 4:59 PM Tom Graves  wrote:

 Is the feature freeze target date March 22nd then?  I saw a few dates thrown 
around want to confirm what we landed on 
I am trying to get the following improvements finished review and in, if 
concerns with either, let me know:- [SPARK-34079][SQL] Merge non-correlated 
scalar subqueries- [SPARK-37618][CORE] Remove shuffle blocks using the shuffle 
service for released executors
Tom

On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang 
 wrote:  
 
 I'd like to add the following new SQL functions in the 3.3 release. These 
functions are useful when overflow or encoding errors occur:   
   - [SPARK-38548][SQL] New SQL function: try_sum    

   - [SPARK-38589][SQL] New SQL function: try_avg   

   - [SPARK-38590][SQL] New SQL function: try_to_binary    

Gengliang
On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo  wrote:

Hello,

I've been trying for a bit to get the following two PRs merged and
into a release, and I'm having some difficulty moving them forward:

https://github.com/apache/spark/pull/34903 - This passes the current
python interpreter to spark-env.sh to allow some currently-unavailable
customization to happen
https://github.com/apache/spark/pull/31774 - This fixes a bug in the
SparkUI reverse proxy-handling code where it does a greedy match for
"proxy" in the URL, and will mistakenly replace the App-ID in the
wrong place.

I'm not exactly sure of how to get attention of PRs that have been
sitting around for a while, but these are really important to our
use-cases, and it would be nice to have them merged in.

Cheers
Andrew

On Wed, Mar 16, 2022 at 6:21 PM Holden Karau  wrote:
>
> I'd like to add/backport the logging in 
> https://github.com/apache/spark/pull/35881 PR so that when users submit 
> issues with dynamic allocation we can better debug what's going on.
>
> On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:
>>
>> There is one item on our sid

Re: Apache Spark 3.3 Release

2022-03-17 Thread Tom Graves
 Is the feature freeze target date March 22nd then?  I saw a few dates thrown 
around want to confirm what we landed on 
I am trying to get the following improvements finished review and in, if 
concerns with either, let me know:- [SPARK-34079][SQL] Merge non-correlated 
scalar subqueries- [SPARK-37618][CORE] Remove shuffle blocks using the shuffle 
service for released executors
Tom

On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang Wang 
 wrote:  
 
 I'd like to add the following new SQL functions in the 3.3 release. These 
functions are useful when overflow or encoding errors occur:   
   - [SPARK-38548][SQL] New SQL function: try_sum    

   - [SPARK-38589][SQL] New SQL function: try_avg   

   - [SPARK-38590][SQL] New SQL function: try_to_binary    

Gengliang
On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo  wrote:

Hello,

I've been trying for a bit to get the following two PRs merged and
into a release, and I'm having some difficulty moving them forward:

https://github.com/apache/spark/pull/34903 - This passes the current
python interpreter to spark-env.sh to allow some currently-unavailable
customization to happen
https://github.com/apache/spark/pull/31774 - This fixes a bug in the
SparkUI reverse proxy-handling code where it does a greedy match for
"proxy" in the URL, and will mistakenly replace the App-ID in the
wrong place.

I'm not exactly sure of how to get attention of PRs that have been
sitting around for a while, but these are really important to our
use-cases, and it would be nice to have them merged in.

Cheers
Andrew

On Wed, Mar 16, 2022 at 6:21 PM Holden Karau  wrote:
>
> I'd like to add/backport the logging in 
> https://github.com/apache/spark/pull/35881 PR so that when users submit 
> issues with dynamic allocation we can better debug what's going on.
>
> On Wed, Mar 16, 2022 at 3:45 PM Chao Sun  wrote:
>>
>> There is one item on our side that we want to backport to 3.3:
>> - vectorized DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
>> Parquet V2 support (https://github.com/apache/spark/pull/35262)
>>
>> It's already reviewed and approved.
>>
>> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves  
>> wrote:
>> >
>> > It looks like the version hasn't been updated on master and still shows 
>> > 3.3.0-SNAPSHOT, can you please update that.
>> >
>> > Tom
>> >
>> > On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
>> >  wrote:
>> >
>> >
>> > Hi All,
>> >
>> > I have created the branch for Spark 3.3:
>> > https://github.com/apache/spark/commits/branch-3.3
>> >
>> > Please, backport important fixes to it, and if you have some doubts, ping 
>> > me in the PR. Regarding new features, we are still building the allow list 
>> > for branch-3.3.
>> >
>> > Best regards,
>> > Max Gekk
>> >
>> >
>> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  
>> > wrote:
>> >
>> > Yes, I agree with you for your whitelist approach for backporting. :)
>> > Thank you for summarizing.
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> >
>> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:
>> >
>> > I think I finally got your point. What you want to keep unchanged is the 
>> > branch cut date of Spark 3.3. Today? or this Friday? This is not a big 
>> > deal.
>> >
>> > My major concern is whether we should keep merging the feature work or the 
>> > dependency upgrade after the branch cut. To make our release time more 
>> > predictable, I am suggesting we should finalize the exception PR list 
>> > first, instead of merging them in an ad hoc way. In the past, we spent a 
>> > lot of time on the revert of the PRs that were merged after the branch 
>> > cut. I hope we can minimize unnecessary arguments in this release. Do you 
>> > agree, Dongjoon?
>> >
>> >
>> >
>> > Dongjoon Hyun  于2022年3月15日周二 15:55写道:
>> >
>> > That is not totally fine, Xiao. It sounds like you are asking a change of 
>> > plan without a proper reason.
>> >
>> > Although we cut the branch Today according our plan, you still can collect 
>> > the list and make a list of exceptions. I'm not blocking what you want to 
>> > do.
>> >
>> > Please let the community start to ramp down as we agreed before.
>> >
>> > Dongjoon
>> >
>> >
>> >
>> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:
>> >
>> > Please do not get me wrong. If we don't cut a branch, we are allowing all

Re: Apache Spark 3.3 Release

2022-03-16 Thread Tom Graves
 It looks like the version hasn't been updated on master and still shows 
3.3.0-SNAPSHOT, can you please update that. 
Tom
On Wednesday, March 16, 2022, 01:41:00 AM CDT, Maxim Gekk 
 wrote:  
 
 Hi All,

I have created the branch for Spark 3.3:
https://github.com/apache/spark/commits/branch-3.3

Please, backport important fixes to it, and if you have some doubts, ping me in 
the PR. Regarding new features, we are still building the allow list for 
branch-3.3.
Best regards,Max Gekk

On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun  wrote:

Yes, I agree with you for your whitelist approach for backporting. :)Thank you 
for summarizing.

Thanks,Dongjoon.

On Tue, Mar 15, 2022 at 4:20 PM Xiao Li  wrote:

I think I finally got your point. What you want to keep unchanged is the branch 
cut date of Spark 3.3. Today? or this Friday? This is not a big deal. 
My major concern is whether we should keep merging the feature work or the 
dependency upgrade after the branch cut. To make our release time more 
predictable, I am suggesting we should finalize the exception PR list first, 
instead of merging them in an ad hoc way. In the past, we spent a lot of time 
on the revert of the PRs that were merged after the branch cut. I hope we can 
minimize unnecessary arguments in this release. Do you agree, Dongjoon?


Dongjoon Hyun  于2022年3月15日周二 15:55写道:

That is not totally fine, Xiao. It sounds like you are asking a change of plan 
without a proper reason.
Although we cut the branch Today according our plan, you still can collect the 
list and make a list of exceptions. I'm not blocking what you want to do.
Please let the community start to ramp down as we agreed before.
Dongjoon


On Tue, Mar 15, 2022 at 3:07 PM Xiao Li  wrote:

Please do not get me wrong. If we don't cut a branch, we are allowing all 
patches to land Apache Spark 3.3. That is totally fine. After we cut the 
branch, we should avoid merging the feature work. In the next three days, let 
us collect the actively developed PRs that we want to make an exception (i.e., 
merged to 3.3 after the upcoming branch cut). Does that make sense?
Dongjoon Hyun  于2022年3月15日周二 14:54写道:

Xiao. You are working against what you are saying.If you don't cut a branch, it 
means you are allowing all patches to land Apache Spark 3.3. No?

> we need to avoid backporting the feature work that are not being well 
> discussed.


On Tue, Mar 15, 2022 at 12:12 PM Xiao Li  wrote:

Cutting the branch is simple, but we need to avoid backporting the feature work 
that are not being well discussed. Not all the members are actively following 
the dev list. I think we should wait 3 more days for collecting the PR list 
before cutting the branch. 
BTW, there are very few 3.4-only feature work that will be affected.

Xiao
Dongjoon Hyun  于2022年3月15日周二 11:49写道:

Hi, Max, Chao, Xiao, Holden and all.
I have a different idea.
Given the situation and small patch list, I don't think we need to postpone the 
branch cut for those patches. It's easier to cut a branch-3.3 and allow 
backporting.
As of today, we already have an obvious Apache Spark 3.4 patch in the branch 
together. This situation only becomes worse and worse because there is no way 
to block the other patches from landing unintentionally if we don't cut a 
branch.
    [SPARK-38335][SQL] Implement parser support for DEFAULT column values

Let's cut `branch-3.3` Today for Apache Spark 3.3.0 preparation.
Best,
Dongjoon.

On Tue, Mar 15, 2022 at 10:17 AM Chao Sun  wrote:

Cool, thanks for clarifying!

On Tue, Mar 15, 2022 at 10:11 AM Xiao Li  wrote:
>>
>> For the following list:
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>> Do you mean we should include them, or exclude them from 3.3?
>
>
> If possible, I hope these features can be shipped with Spark 3.3.
>
>
>
> Chao Sun  于2022年3月15日周二 10:06写道:
>>
>> Hi Xiao,
>>
>> For the following list:
>>
>> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>> #34659 [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
>> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>
>> Do you mean we should include them, or exclude them from 3.3?
>>
>> Thanks,
>> Chao
>>
>> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun  
>> wrote:
>> >
>> > The following was tested and merged a few minutes ago. So, we can remove 
>> > it from the list.
>> >
>> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>> >
>> > Thanks,
>> > Dongjoon.
>> >
>> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li  wrote:
>> >>
>> >> Let me clarify my above suggestion. Maybe we can wait 3 more days to 
>> >> collect the list of actively developed PRs that we want to merge to 3.3 
>> >> after the branch cut?
>> >>
>> >> Please do not rush to merge the PRs that are not fully reviewed. We can 
>> >> cut the branch this Friday and continue merging the PRs that have been 

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread Tom Graves
 Great to have other integrations and improved K8s support.  Left some 
comments/questions in the design doc. 
TomOn Tuesday, November 30, 2021, 02:46:42 AM CST, Yikun Jiang 
 wrote:  
 
 Hey everyone,

I'd like to start a discussion on "Support Volcano/Alternative Schedulers 
Proposal".

This SPIP is proposed to make spark k8s schedulers provide more YARN like 
features (such as queues and minimum resources before scheduling jobs) that 
many folks want on Kubernetes.

The goal of this SPIP is to improve current spark k8s scheduler 
implementations, add the ability of batch scheduling and support volcano as one 
of implementations.

Design doc: 
https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6FgJIRA:
 https://issues.apache.org/jira/browse/SPARK-36057Part of PRs:
Ability to create resources https://github.com/apache/spark/pull/34599
Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456

Regards,Yikun  

Re: Update Spark 3.3 release window?

2021-10-28 Thread Tom Graves
 +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
Tom 
On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan 
 wrote:  
 
 
+1 to EOL 2.xMid march sounds like a good placeholder for 3.3.
Regards,Mridul 
On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:

Seems fine to me - as good a placeholder as anything.Would that be about time 
to call 2.x end-of-life?
On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon  wrote:

Hi all,
Spark 3.2. is out. Shall we update the release window 
https://spark.apache.org/versioning-policy.html?
I am thinking of Mid March 2022 (5 months after the 3.2 release) for code 
freeze and onward.


  

Re: [VOTE] Release Spark 3.2.0 (RC2)

2021-09-17 Thread Tom Graves
 Thanks, I didn't see that one.
Tom
On Friday, September 17, 2021, 10:45:36 AM CDT, Gengliang Wang 
 wrote:  
 
 Hi Tom,
I will cut RC3 right after SPARK-36772 is resolved.
Thanks,Gengliang
On Fri, Sep 17, 2021 at 10:03 PM Tom Graves  wrote:

 Hey folks,  
just curious what the status was on doing an rc3?  I didn't see any blockers 
left since it looks like parquet change got merged.
Thanks,Tom
On Thursday, September 9, 2021, 12:27:58 PM CDT, Mridul Muralidharan 
 wrote:  
 
 
I have filed a blocker, SPARK-36705 which will need to be addressed. 
Regards,Mridul

On Sun, Sep 5, 2021 at 8:47 AM Gengliang Wang  wrote:

Hi all,
the voting fails.Liang-Chi reported a new block SPARK-36669. We will have RC3 
when the existing issues are resolved.

On Thu, Sep 2, 2021 at 5:01 AM Sean Owen  wrote:

This RC looks OK to me too, understanding we may need to have RC3 for the 
outstanding issues though.
The issue with the Scala 2.13 POM is still there; I wasn't able to figure it 
out (anyone?), though it may not affect 'normal' usage (and is work-around-able 
in other uses, it seems), so may be sufficient if Scala 2.13 support is 
experimental as of 3.2.0 anyway.

On Wed, Sep 1, 2021 at 2:08 AM Gengliang Wang  wrote:

Please vote on releasing the following candidate as Apache Spark version 3.2.0.

The vote is open until 11:59pm Pacific time September 3 and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.2.0-rc2 (commit 
6bb3523d8e838bd2082fb90d7f3741339245c044):
https://github.com/apache/spark/tree/v3.2.0-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1389

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-docs/

The list of bug fixes going into 3.2.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12349407

This release is using the release script of the tag v3.2.0-rc2.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.2.0?
===
The current list of open tickets targeted at 3.2.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.2.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


  
  

Re: [VOTE] Release Spark 3.2.0 (RC2)

2021-09-17 Thread Tom Graves
 Hey folks,  
just curious what the status was on doing an rc3?  I didn't see any blockers 
left since it looks like parquet change got merged.
Thanks,Tom
On Thursday, September 9, 2021, 12:27:58 PM CDT, Mridul Muralidharan 
 wrote:  
 
 
I have filed a blocker, SPARK-36705 which will need to be addressed. 
Regards,Mridul

On Sun, Sep 5, 2021 at 8:47 AM Gengliang Wang  wrote:

Hi all,
the voting fails.Liang-Chi reported a new block SPARK-36669. We will have RC3 
when the existing issues are resolved.

On Thu, Sep 2, 2021 at 5:01 AM Sean Owen  wrote:

This RC looks OK to me too, understanding we may need to have RC3 for the 
outstanding issues though.
The issue with the Scala 2.13 POM is still there; I wasn't able to figure it 
out (anyone?), though it may not affect 'normal' usage (and is work-around-able 
in other uses, it seems), so may be sufficient if Scala 2.13 support is 
experimental as of 3.2.0 anyway.

On Wed, Sep 1, 2021 at 2:08 AM Gengliang Wang  wrote:

Please vote on releasing the following candidate as Apache Spark version 3.2.0.

The vote is open until 11:59pm Pacific time September 3 and passes if a 
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.2.0-rc2 (commit 
6bb3523d8e838bd2082fb90d7f3741339245c044):
https://github.com/apache/spark/tree/v3.2.0-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1389

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc2-docs/

The list of bug fixes going into 3.2.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12349407

This release is using the release script of the tag v3.2.0-rc2.


FAQ

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.2.0?
===
The current list of open tickets targeted at 3.2.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.2.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==
In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


  

Re: -1s on committed but not released code?

2021-08-20 Thread Tom Graves
 So personally I think its fine to comment post merge but I think an issue 
should also be filed (that might just be me though).  This change was reviewed 
and committed so if someone found a problem with it, then it should be 
officially tracked as a bug. 
I would think a -1 on a already committed issue is very rare and the person who 
gave it should give technical reasons for it. From that reason it should be 
fairly clear, if its a functional bug just fix it as a bug, if its something 
else with the design then I think it has to be discussed further.  In my 
opinion it has been committed and is valid until that discussion comes to a 
conclusion.   The one argument against that is if something is pushed in very 
quickly and people aren't given time to adequately review.  I can see in that 
case where you might revert it more quickly. 
Tom

On Thursday, August 19, 2021, 08:25:14 PM CDT, Hyukjin Kwon 
 wrote:  
 
 Yeah, I think we can discuss and revert it (or fix it) per the veto set. Often 
problems are found later after codes are merged.


2021년 8월 20일 (금) 오전 4:08, Mridul Muralidharan 님이 작성:

Hi Holden,
  In the past, I have seen discussions on the merged pr to thrash out the 
details.Usually it would be clear whether to revert and reformulate the change 
or concerns get addressed and possibly result in follow up work.
This is usually helped by the fact that we typically are conservative and don’t 
merge changes too quickly: giving folks sufficient time to review and opine.
Regards,Mridul 
On Thu, Aug 19, 2021 at 1:36 PM Holden Karau  wrote:

Hi Y'all,
This just recently came up but I'm not super sure on how we want to handle this 
in general. If code was committed under the lazy consensus model and then a 
committer or PMC -1s it post merge, what do we want to do?
I know we had some previous discussion around -1s, but that was largely focused 
on pre-commit -1s.
Cheers,
Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

  

Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-18 Thread Tom Graves
 +1  Ran through some internal tests.
Thanks,Tom
On Thursday, June 17, 2021, 05:11:21 AM CDT, Yi Wu  
wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.0.3.

The vote is open until Jun 21th 3AM (PST) and passes if a majority +1 PMC votes 
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.0.3-rc1 (commit 
65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
https://github.com/apache/spark/tree/v3.0.3-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1386/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/

The list of bug fixes going into 3.0.3 can be found at the following 
URL:https://issues.apache.org/jira/projects/SPARK/versions/12349723
This release is using the release script of the tag v3.0.3-rc1.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.3?
===

The current list of open tickets targeted at 3.0.3 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.
  

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Tom Graves
 so repartition() would look at some other config 
(spark.sql.adaptive.advisoryPartitionSizeInBytes) to decide the size to use to 
partition it on then?  Does it require AQE?  If so what does a repartition() 
call do if AQE is not enabled? this is essentially a new api so would 
repartitionBySize or something be less confusing to users who already use 
repartition(num_partitions).
Tom
On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan  
wrote:  
 
 Ideally this should be handled by the underlying data source to produce a 
reasonably partitioned RDD as the input data. However if we already have a 
poorly partitioned RDD at hand and want to repartition it properly, I think an 
extra shuffle is required so that we can know the partition size first.
That said, I think calling `.repartition()` with no args is indeed a good 
solution for this problem.
On Sat, May 22, 2021 at 1:12 AM mhawes  wrote:

Adding /another/ update to say that I'm currently planning on using a
recently introduced feature whereby calling `.repartition()` with no args
will cause the dataset to be optimised by AQE. This actually suits our
use-case perfectly!

Example:

        sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
        Dataset dataset = sparkSession.range(1, 4, 1,
4).repartition();

        assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
// true


Relevant PRs/Issues:
[SPARK-31220][SQL] repartition obeys initialPartitionNum when
adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
[SPARK-32056][SQL] Coalesce partitions for repartition by expressions when
AQE is enabled https://github.com/apache/spark/pull/28900
[SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
sql when AQE is enabled https://github.com/apache/spark/pull/28952



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


  

Re: Please take a look at the draft of the Spark 3.1.1 release notes

2021-03-01 Thread Tom Graves
 Thanks Hyukjin, overall they look good to me.
TomOn Saturday, February 27, 2021, 05:00:42 PM CST, Jungtaek Lim 
 wrote:  
 
 Thanks Hyukjin! I've only looked into the SS part, and added a comment. 
Otherwise it looks great! 
On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun  wrote:

Thank you for sharing, Hyukjin!
Dongjoon.
On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon  wrote:

Hi all,

I am preparing to publish and announce Spark 3.1.1.
This is the draft of the release note, and I plan to edit a bit more and use it 
as the final release note.
Please take a look and let me know if I missed any major changes or something.

https://docs.google.com/document/d/1x6zzgRsZ4u1DgUh1XpGzX914CZbsHeRYpbqZ-PV6wdQ/edit?usp=sharing

Thanks.

  

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-02-02 Thread Tom Graves
 ok thanks for the update. That is marked as an improvement, if its a blocker 
can we mark it as such and describe why.  I searched jiras and didn't see any 
critical or blockers open.
TomOn Tuesday, February 2, 2021, 05:12:24 PM CST, Hyukjin Kwon 
 wrote:  
 
 There is one here: https://github.com/apache/spark/pull/31440. There look 
several issues being identified (to confirm that this is an issue in OSS too), 
and fixed in parallel.
There are a bit of unexpected delays here as several issues more were found. I 
will try to file and share relevant JIRAs as soon as I can confirm.

2021년 2월 3일 (수) 오전 2:36, Tom Graves 님이 작성:

 Just curious if we have an update on next rc? is there a jira for the tpcds 
issue?
Thanks,Tom
On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon 
 wrote:  
 
 Just to share the current status, most of the known issues were resolved. Let 
me know if there are some more.
One thing left is a performance regression in TPCDS being investigated. Once 
this is identified (and fixed if it should be), I will cut another RC right 
away.
I roughly expect to cut another RC next Monday.

Thanks guys.
2021년 1월 27일 (수) 오전 5:26, Terry Kim 님이 작성:

Hi,
Please check if the following regression should be included: 
https://github.com/apache/spark/pull/31352
Thanks,Terry
On Tue, Jan 26, 2021 at 7:54 AM Holden Karau  wrote:

If were ok waiting for it, I’d like to get 
https://github.com/apache/spark/pull/31298 in as well (it’s not a regression 
but it is a bug fix).
On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon  wrote:

It looks like a cool one but it's a pretty big one and affects the plans 
considerably ... maybe it's best to avoid adding it into 3.1.1 in particular 
during the RC period if this isn't a clear regression that affects many users.
2021년 1월 26일 (화) 오후 11:23, Peter Toth 님이 작성:

Hey,
Sorry for chiming in a bit late, but I would like to suggest my PR 
(https://github.com/apache/spark/pull/28885) for review and inclusion into 
3.1.1.

Currently, invalid reuse reference nodes appear in many queries, causing 
performance issues and incorrect explain plans. Now that 
https://github.com/apache/spark/pull/31243 got merged these invalid references 
can be easily found in many of our golden files on master: 
https://github.com/apache/spark/pull/28885#issuecomment-767530441.
But the issue isn't master (3.2) specific, actually it has been there since 3.0 
when Dynamic Partition Pruning was added. 
So it is not a regression from 3.0 to 3.1.1, but in some cases (like TPCDS 
q23b) it is causing performance regression from 2.4 to 3.x.

Thanks,Peter
On Tue, Jan 26, 2021 at 6:30 AM Hyukjin Kwon  wrote:

Guys, I plan to make an RC as soon as we have no visible issues. I have merged 
a few correctness issues. There look:
- https://github.com/apache/spark/pull/31319 waiting for a review (I will do it 
too soon).
- https://github.com/apache/spark/pull/31336
- I know Max's investigating the perf regression one which hopefully will be 
fixed soon.

Are there any more blockers or correctness issues? Please ping me or say it out 
here.
I would like to avoid making an RC when there are clearly some issues to be 
fixed.
If you're investigating something suspicious, that's fine too. It's better to 
make sure we're safe instead of rushing an RC without finishing the 
investigation.

Thanks all.


2021년 1월 22일 (금) 오후 6:19, Hyukjin Kwon 님이 작성:

Sure, thanks guys. I'll start another RC after the fixes. Looks like we're 
almost there.
On Fri, 22 Jan 2021, 17:47 Wenchen Fan,  wrote:

BTW, there is a correctness bug being fixed at 
https://github.com/apache/spark/pull/30788 . It's not a regression, but the fix 
is very simple and it would be better to start the next RC after merging that 
fix.
On Fri, Jan 22, 2021 at 3:54 PM Maxim Gekk  wrote:

Also I am investigating a performance regression in some TPC-DS queries (q88 
for instance) that is caused by a recent commit in 3.1, highly likely in the 
period from 19th November, 2020 to 18th December, 2020.
Maxim Gekk

Software Engineer

Databricks, Inc.


On Fri, Jan 22, 2021 at 10:45 AM Wenchen Fan  wrote:

-1 as I just found a regression in 3.1. A self-join query works well in 3.0 but 
fails in 3.1. It's being fixed at https://github.com/apache/spark/pull/31287
On Fri, Jan 22, 2021 at 4:34 AM Tom Graves  wrote:

 +1
built from tarball, verified sha and regular CI and tests all pass.
Tom
On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.1.1.
The vote is open until January 22nd 4PM PST and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.1.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v3.1.1-rc1 (commit 
53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):https://github.com/apache/spark/tree/v3.1.1-rc1

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-02-02 Thread Tom Graves
 Just curious if we have an update on next rc? is there a jira for the tpcds 
issue?
Thanks,Tom
On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon 
 wrote:  
 
 Just to share the current status, most of the known issues were resolved. Let 
me know if there are some more.
One thing left is a performance regression in TPCDS being investigated. Once 
this is identified (and fixed if it should be), I will cut another RC right 
away.
I roughly expect to cut another RC next Monday.

Thanks guys.
2021년 1월 27일 (수) 오전 5:26, Terry Kim 님이 작성:

Hi,
Please check if the following regression should be included: 
https://github.com/apache/spark/pull/31352
Thanks,Terry
On Tue, Jan 26, 2021 at 7:54 AM Holden Karau  wrote:

If were ok waiting for it, I’d like to get 
https://github.com/apache/spark/pull/31298 in as well (it’s not a regression 
but it is a bug fix).
On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon  wrote:

It looks like a cool one but it's a pretty big one and affects the plans 
considerably ... maybe it's best to avoid adding it into 3.1.1 in particular 
during the RC period if this isn't a clear regression that affects many users.
2021년 1월 26일 (화) 오후 11:23, Peter Toth 님이 작성:

Hey,
Sorry for chiming in a bit late, but I would like to suggest my PR 
(https://github.com/apache/spark/pull/28885) for review and inclusion into 
3.1.1.

Currently, invalid reuse reference nodes appear in many queries, causing 
performance issues and incorrect explain plans. Now that 
https://github.com/apache/spark/pull/31243 got merged these invalid references 
can be easily found in many of our golden files on master: 
https://github.com/apache/spark/pull/28885#issuecomment-767530441.
But the issue isn't master (3.2) specific, actually it has been there since 3.0 
when Dynamic Partition Pruning was added. 
So it is not a regression from 3.0 to 3.1.1, but in some cases (like TPCDS 
q23b) it is causing performance regression from 2.4 to 3.x.

Thanks,Peter
On Tue, Jan 26, 2021 at 6:30 AM Hyukjin Kwon  wrote:

Guys, I plan to make an RC as soon as we have no visible issues. I have merged 
a few correctness issues. There look:
- https://github.com/apache/spark/pull/31319 waiting for a review (I will do it 
too soon).
- https://github.com/apache/spark/pull/31336
- I know Max's investigating the perf regression one which hopefully will be 
fixed soon.

Are there any more blockers or correctness issues? Please ping me or say it out 
here.
I would like to avoid making an RC when there are clearly some issues to be 
fixed.
If you're investigating something suspicious, that's fine too. It's better to 
make sure we're safe instead of rushing an RC without finishing the 
investigation.

Thanks all.


2021년 1월 22일 (금) 오후 6:19, Hyukjin Kwon 님이 작성:

Sure, thanks guys. I'll start another RC after the fixes. Looks like we're 
almost there.
On Fri, 22 Jan 2021, 17:47 Wenchen Fan,  wrote:

BTW, there is a correctness bug being fixed at 
https://github.com/apache/spark/pull/30788 . It's not a regression, but the fix 
is very simple and it would be better to start the next RC after merging that 
fix.
On Fri, Jan 22, 2021 at 3:54 PM Maxim Gekk  wrote:

Also I am investigating a performance regression in some TPC-DS queries (q88 
for instance) that is caused by a recent commit in 3.1, highly likely in the 
period from 19th November, 2020 to 18th December, 2020.
Maxim Gekk

Software Engineer

Databricks, Inc.


On Fri, Jan 22, 2021 at 10:45 AM Wenchen Fan  wrote:

-1 as I just found a regression in 3.1. A self-join query works well in 3.0 but 
fails in 3.1. It's being fixed at https://github.com/apache/spark/pull/31287
On Fri, Jan 22, 2021 at 4:34 AM Tom Graves  wrote:

 +1
built from tarball, verified sha and regular CI and tests all pass.
Tom
On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.1.1.
The vote is open until January 22nd 4PM PST and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.1.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v3.1.1-rc1 (commit 
53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):https://github.com/apache/spark/tree/v3.1.1-rc1
The release files, including signatures, digests, etc. can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
Signatures used for Spark RCs can be found in this 
file:https://dist.apache.org/repos/dist/dev/spark/KEYS
The staging repository for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1364
The documentation corresponding to this release can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/

The list of bug fixes going into 3.1.1 can be found at the following 
URL:https://s.apache.org/41kf2
This release is using the release script

Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-21 Thread Tom Graves
 +1
built from tarball, verified sha and regular CI and tests all pass.
Tom
On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.1.1.
The vote is open until January 22nd 4PM PST and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.1.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
The tag to be voted on is v3.1.1-rc1 (commit 
53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):https://github.com/apache/spark/tree/v3.1.1-rc1
The release files, including signatures, digests, etc. can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
Signatures used for Spark RCs can be found in this 
file:https://dist.apache.org/repos/dist/dev/spark/KEYS
The staging repository for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1364
The documentation corresponding to this release can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/

The list of bug fixes going into 3.1.1 can be found at the following 
URL:https://s.apache.org/41kf2
This release is using the release script of the tag v3.1.1-rc1.
FAQ
===
What happened to 3.1.0?===

There was a technical issue during Apache Spark 3.1.0 preparation, and it was 
discussed and decided to skip 3.1.0.
Please see https://spark.apache.org/news/next-official-release-spark-3.1.1.html 
for more details.

=How can I help test this 
release?=
If you are a Spark user, you can help us test this release by takingan existing 
Spark workload and running on this release candidate, thenreporting any 
regressions.
If you're working in PySpark you can set up a virtual env and installthe 
current RC via "pip install 
https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz;
and see if anything important breaks.In the Java/Scala, you can add the staging 
repository to your projects resolvers and testwith the RC (make sure to clean 
up the artifact cache before/after soyou don't end up building with an out of 
date RC going forward).
===What should happen to JIRA tickets 
still targeting 3.1.1?===
The current list of open tickets targeted at 3.1.1 can be found 
at:https://issues.apache.org/jira/projects/SPARK and search for "Target 
Version/s" = 3.1.1
Committers should look at those and triage. Extremely important bugfixes, 
documentation, and API tweaks that impact compatibility shouldbe worked on 
immediately. Everything else please retarget to anappropriate release.
==But my bug isn't fixed?==
In order to make timely releases, we will typically not hold therelease unless 
the bug in question is a regression from the previousrelease. That being said, 
if there is something which is a regressionthat has not been correctly targeted 
please ping me or a committer tohelp target the issue.
  

Re: Removing references to Master

2021-01-19 Thread Tom Graves
 thanks for the interest, I haven't had time to work on replacing Master, 
hopefully for the next release but time dependent, if you follow the lira - 
https://issues.apache.org/jira/browse/SPARK-32333 - I will post there when I 
start or if someone else picks it up should see activity there.
Tom
On Saturday, January 16, 2021, 07:56:14 AM CST, João Paulo Leonidas 
Fernandes Dias da Silva  wrote:  
 
 So, it looks like slave was already replaced in the docs. Waiting for a 
definition on the replacement(s) for master so I can create a PR for the docs 
only.
On Sat, Jan 16, 2021 at 8:30 AM jpaulorio  wrote:

What about updating the documentation as well? Does it depend on the codebase
changes or can we treat it as a separate issue? I volunteer to update both
Master and Slave terms when there's an agreement on what should be used as
replacement. Since  [SPARK-32004]
   was already resolved,
can I start replacing slave with worker?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


  

Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Tom Graves
 I think it makes sense to wait and see what they say on INFRA-21266.  
In the mean time hopefully people can start testing it and if no other problems 
found and vote passes can stay published.  It seems like the 2 issues above 
wouldn't be blockers in my opinion and could be handled in a 3.1.1 but others 
can chime too.
If we find other issues with it in testing and they can't revert in INFRA-21266 
- I assume we handle by putting some documentation out there telling people not 
to use it and we go to 3.1.1.  
One thing I didn't follow was the comment: "release 3.1.1 fast that 
exceptionally allows a bit of breaking changes" - what do you mean by that?
if there is anything we can add to our release process documentation to prevent 
in the future that would be great as well.
Tom
On Wednesday, January 6, 2021, 03:07:26 PM CST, Hyukjin Kwon 
 wrote:  
 
 
Yes, it was my mistake. I faced the same issue as INFRA-20651, and it is worse 
in my case because I misunderstood that RC and releases are separately released 
out.
Right after this, I filed an INFRA JIRA to revert this at INFRA-21266. We can 
wait and see how it goes.

Though, I know it’s impossible to remove by right. It is possible to overwrite 
but it will affect people who already have it in their cache.
I am thinkthing two options:
   
   - Skip 3.1.0 and release 3.1.1 right away since the release isn’t officially 
out to the main Apache repo/mirrors but only one of the downstream channels. We 
can just say that there was something wrong during the 3.1.0 release so it 
became 3.1.1 right away.
   
   - Release 3.1.0 out, of course, based on the vote results here. We could 
release 3.1.1 fast that exceptionally allows a bit of breaking changes with 
properly documenting it in a release note and migration guide.
I would appreciate it if I could hear other people' opinions.

Thanks.



  

Re: [build system] WE'RE LIVE!

2020-12-04 Thread Tom Graves
 thanks Shane and folks for great work.
Not sure if this is at all related but I noticed the spark master deploy job 
hasn't been running and the last one Dec 2nd 
failed:https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/3186/

Not sure if this is result of upgrade?
Thanks,TomOn Tuesday, December 1, 2020, 06:55:27 PM CST, shane knapp ☠ 
 wrote:  
 
 https://amplab.cs.berkeley.edu/jenkins/

i cleared the build queue, so you'll need to retrigger your PRs.  there will be 
occasional downtime over the next few days and weeks as we uncover system-level 
errors and more reimaging happens...  but for now, we're building.
a big thanks goes out to jon for his work on the project!  we couldn't have 
done it w/o him.
shane-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu  

Re: Spark branch-3.1

2020-12-04 Thread Tom Graves
 Can we update the version number on the master branch ? its still 
3.1.0-SNAPSHOT
Thanks,Tom 
On Friday, December 4, 2020, 04:54:12 AM CST, Hyukjin Kwon 
 wrote:  
 
 
Hi all,

It’s 4th PDT and branch-3.1 is cut out now as planned.


Mid Dec 2020 QA period. Focus on bug fixes, tests, stability and docs. 
Generally, no new features merged


Now we’re in the QA period. Please focus on testing, polishing, stability and 
docs
for Spark 3.1.0, and hope we can have a nice Spark 3.1.0 on time :-).

In addition, I retargeted several JIRAs to 3.2.0. Please un-target or re-target 
issues
if they don’t make sense for 3.1.

Thank you all in advance.
  

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Tom Graves
 +1 since its a correctness issue, I think its ok to change the behavior to 
make sure the user is aware of it and let them decide.
Tom
On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh 
 wrote:  
 
 Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

  

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Tom Graves
 +1
Tom
On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan 
 wrote:  
 
 Hi,
I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based shuffle 
to improve shuffle efficiency.Please take a look at:   
   - SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
   - SPIP doc: 
https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
   - POC against master and results summary : 
https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
Active discussions on the jira and SPIP document have settled.
I will leave the vote open until Friday (the 18th September 2020), 5pm CST.

[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this 
is a good idea because ...

Thanks,Mridul  

Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-08-31 Thread Tom Graves
 +1
Tom
On Friday, August 28, 2020, 09:02:31 AM CDT, 郑瑞峰  
wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.0.1.
The vote is open until Sep 2nd at 9AM PST and passes if a majority +1 PMC votes 
are cast, with a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 3.0.1[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/
There are currently no issues targeting 3.0.1 (try project = SPARK AND "Target 
Version/s" = "3.0.1" AND status in (Open, Reopened, "In Progress"))
The tag to be voted on is v3.0.1-rc3 (commit 
dc04bf53fe821b7a07f817966c6c173f3b3788c6):https://github.com/apache/spark/tree/v3.0.1-rc3
The release files, including signatures, digests, etc. can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-bin/
Signatures used for Spark RCs can be found in this 
file:https://dist.apache.org/repos/dist/dev/spark/KEYS
The staging repository for this release can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1357/
The documentation corresponding to this release can be found 
at:https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-docs/
The list of bug fixes going into 3.0.1 can be found at the following 
URL:https://s.apache.org/q9g2d
This release is using the release script of the tag v3.0.1-rc3.
FAQ

=How can I help test this 
release?=
If you are a Spark user, you can help us test this release by takingan existing 
Spark workload and running on this release candidate, thenreporting any 
regressions.
If you're working in PySpark you can set up a virtual env and installthe 
current RC and see if anything important breaks, in the Java/Scalayou can add 
the staging repository to your projects resolvers and testwith the RC (make 
sure to clean up the artifact cache before/after soyou don't end up building 
with an out of date RC going forward).
===What should happen to JIRA tickets 
still targeting 3.0.1?===
The current list of open tickets targeted at 3.0.1 can be found 
at:https://issues.apache.org/jira/projects/SPARK and search for "Target 
Version/s" = 3.0.1
Committers should look at those and triage. Extremely important bugfixes, 
documentation, and API tweaks that impact compatibility shouldbe worked on 
immediately. Everything else please retarget to anappropriate release.
==But my bug isn't fixed?==
In order to make timely releases, we will typically not hold therelease unless 
the bug in question is a regression from the previousrelease. That being said, 
if there is something which is a regressionthat has not been correctly targeted 
please ping me or a committer tohelp target the issue.

  

Re: Renaming blacklisting feature input

2020-08-25 Thread Tom Graves
 Any other feedback here?  The couple I've heard preferred in various 
conversations are excludeList and blockList.  If not I'll just make proposal on 
jira and continue discussion there and anyone interested can watch this jira.
Thanks,Tom
On Tuesday, August 4, 2020, 09:19:01 AM CDT, Tom Graves 
 wrote:  
 
 Hey Folks,
We have jira https://issues.apache.org/jira/browse/SPARK-32037 to rename the 
blacklisting feature.  It would be nice to come to a consensus on what we want 
to call that.It doesn't looks like we have any references to whitelist other 
then from other components.  There is some discussion on the jira and I linked 
to what some other projects have done so please take a look at that.
A few options: - blocklist - denylist - healthy /HealthTracker - quarantined - 
benched - exiled - banlist
Please let me know thoughts and suggestions.

Thanks,Tom  

Re: Removing references to Master

2020-08-25 Thread Tom Graves
 Thanks for the replies so far, is there any other feedback here?    Of the 
replies so far I think Leader has been mentioned the most.
Tom
On Tuesday, August 4, 2020, 09:33:14 AM CDT, Russell Spitzer 
 wrote:  
 
 I think we should use Scheduler or Comptroller or Leader; something that 
evokes better describes the purpose as a resource management service. I would 
rather we didn't use controller, coordinator, application manager, primary 
because I feel that those terms make it seem like the process is central to an 
Application's function when in reality it does nothing other than turn off or 
on containers and processes. The key example here for me would be, if the 
StandaloneResourceManager goes down, a running app is basically unaffected . 
The initial usage of "master" was misleading even in context of previous CS 
usage of the term imho and we should choose a much more limited term to 
describe it now that we have a chance for a rename. Of course, ymmv and really 
anything would be better than the current status quo which is both misleading 
and insensitive.
On Tue, Aug 4, 2020 at 9:08 AM Holden Karau  wrote:

I think this is a good idea, and yes keeping it backwards compatible initially 
is important since we missed the boat on Spark 3. I like the Controller/Leader 
one since I think that does a good job of reflecting the codes role.
On Tue, Aug 4, 2020 at 7:01 AM Tom Graves  wrote:

Hey everyone,
I filed jira https://issues.apache.org/jira/browse/SPARK-32333 to remove 
references to Master.  I realize this is a bigger change then the slave jira 
but I wanted to get folks input on if they are ok with making the change and if 
so we would need to pick a name to use instead.  I think we should keep it 
backwards compatible at first as to not break anyone and depending on what we 
find might break it up into multiple smaller liras.
A few name possibilities: - ApplicationManager - StandaloneClusterManager - 
Coordinator - Primary - Controller
Thoughts or suggestions?
Thanks,Tom




-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
YouTube Live Streams: https://www.youtube.com/user/holdenkarau
  

Re: 回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-08-25 Thread Tom Graves
 Hey,
I'm just curious what the status of the 3.0.1 release is?  Do we have some 
blockers we are waiting on?
Thanks,Tom
On Sunday, August 16, 2020, 09:07:44 PM CDT, ruifengz 
 wrote:  
 
  
Thanks for letting us know this issue.
 

 
 On 8/16/20 11:31 PM, Takeshi Yamamuro wrote:
  
 
I've checked the Jenkins log and It seems the commit from 
https://github.com/apache/spark/pull/29404 caused the failure. 
   
  On Sat, Aug 15, 2020 at 10:43 PM Koert Kuipers  wrote:
  
  i noticed commit today that seems to prepare for 3.0.1-rc1: commit 
05144a5c10cd37ebdbb55fde37d677def49af11f
 Author: Ruifeng Zheng 
 Date:   Sat Aug 15 01:37:47 2020 +
 
     Preparing Spark release v3.0.1-rc1 
  so i tried to build spark on that commit and i get failure in sql: 
  09:36:57.371 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 
77.0 failed 1 times; aborting job
 [info] - SPARK-28224: Aggregate sum big decimal overflow *** FAILED *** (306 
milliseconds)
 [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 77.0 failed 1 times, most recent failure: Lost task 0.0 in 
stage 77.0 (TID 197, 192.168.11.17, executor driver): 
java.lang.ArithmeticException: 
Decimal(expanded,0.246000,39,18}) cannot be 
represented as Decimal(38, 18).
 [info] at org.apache.spark.sql.types.Decimal.toPrecision(Decimal.scala:369)
 [info] 
atorg.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregate_sum_0$(Unknown
 Source)
 [info] 
atorg.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
 Source)
 [info] 
atorg.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
 Source)
 [info] 
atorg.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
 [info] 
atorg.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 [info] 
atorg.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
 [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
 [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
 [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
 [info] at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2138)
 [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 [info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
 [info] 
atorg.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
 [info] 
atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 [info] 
atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 [info] at java.lang.Thread.run(Thread.java:748) 
  [error] Failed tests:
 [error] org.apache.spark.sql.DataFrameSuite  
  On Thu, Aug 13, 2020 at 8:19 PM Jason Moore 
 wrote:
  
  Thank you so much!  Any update on getting the RC1 up for vote? 
  Jason. 

 
   From: 郑瑞峰 
 Sent: Wednesday, 5 August 2020 12:54 PM
 To: Jason Moore ; Spark dev list 

 Subject: 回复: [DISCUSS] Apache Spark 3.0.1 Release     Hi all,  I am going to 
prepare the realease of 3.0.1 RC1, with the help of Wenchen. 
  
  -- 原始邮件 --  发件人: "Jason Moore" 
; 发送时间: 2020年7月30日(星期四) 上午10:35 收件人: 
"dev"; 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release  
   
Hi all,
 
 
 
Discussion around 3.0.1 seems to have trickled away.  What was blocking the 
release process kicking off?  I can see some unresolved bugs raised against 
3.0.0, but conversely there were quite a few critical correctness fixes waiting 
to be released.
 
 
 
Cheers,
 
Jason.
 
 
  
From:  Takeshi Yamamuro 
 Date: Wednesday, 15 July 2020 at 9:00 am
 To: Shivaram Venkataraman 
 Cc: "dev@spark.apache.org" 
 Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
   
 

> Just wanted to check if there are any blockers that we are still waiting for 
> to start the new release process.
  
I don't see any on-going blocker in my area.
  
Thanks for the notification.
   
 
   
Bests,
   
Tkaeshi
   
 
   
On Wed, Jul 15, 2020 at 4:03 AM Dongjoon Hyun  wrote:
  

Hi, Yi.
  
 
   
Could you explain why you think that is a blocker? For the given example from 
the JIRA description,
   
 

spark.udf.register("key", udf((m: Map[String, String]) => 
m.keys.head.toInt))   Seq(Map("1" -> "one", "2" -> 
"two")).toDF("a").createOrReplaceTempView("t")   checkAnswer(sql("SELECT 
key(a) AS k FROM t GROUP BY 

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-21 Thread Tom Graves
 There is a correctness issue with caching that should go into this if 
possible: https://github.com/apache/spark/pull/29506
Tom
On Wednesday, August 19, 2020, 11:18:37 AM CDT, Wenchen Fan 
 wrote:  
 
 I think so. I don't see other bug reports for 2.4.
On Thu, Aug 20, 2020 at 12:11 AM Nicholas Marion  wrote:


It appears all 3 issues slated for Spark 2.4.7 have been merged. Should we be 
looking at getting RC2 ready?




| Regards, 

NICHOLAS T. MARION 
IBM Open Data Analytics for z/OS - CPO and Service Team Lead  |


|  |
| Phone: 1-845-433-5010 | Tie-Line: 293-5010 
E-mail: nmar...@us.ibm.com 
Find me on:   | 

2455 South Rd 
Poughkeepie, New York 12601-5400 
United States  |
|  |
|  |




Xiao Li ---08/17/2020 11:33:30 
AM---https://issues.apache.org/jira/browse/SPARK-32609

From: Xiao Li 
To: Prashant Sharma 
Cc: Takeshi Yamamuro , dev 
Date: 08/17/2020 11:33 AM
Subject: [EXTERNAL] Re: [VOTE] Release Spark 2.4.7 (RC1)





https://issues.apache.org/jira/browse/SPARK-32609 got merged. This is to fix a 
correctness bug in DSV2 of Spark 2.4. Please include it in the upcoming Spark 
2.4.7 release. 

Thanks,

Xiao

On Sun, Aug 9, 2020 at 10:26 PM Prashant Sharma  wrote:   
Thanks for letting us know. So this vote is cancelled in favor of RC2.



On Sun, Aug 9, 2020 at 8:31 AM Takeshi Yamamuro  wrote:
Thanks for letting us know about the two issues above, Dongjoon.


I've checked the release materials (signatures, tag, ...) and it looks fine, 
too.
Also, I run the tests on my local Mac (java 1.8.0) with the options
`-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
and they passed.

Bests,
Takeshi



On Sun, Aug 9, 2020 at 11:06 AM Dongjoon Hyun  wrote:  
 
Another instance is SPARK-31703 which filed on May 13th and the PR arrived two 
days ago.

    [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big 
endian platforms
    https://github.com/apache/spark/pull/29383

It seems that the patch is already ready in this case.
I raised the priority of SPARK-31703 to `Blocker` for both Apache Spark 2.4.7 
and 3.0.1.

Bests,
Dongjoon.


On Sat, Aug 8, 2020 at 6:10 AM Holden Karau  wrote:
I'm going to go ahead and vote -0 then based on that then.

On Fri, Aug 7, 2020 at 11:36 PM Dongjoon Hyun  wrote:  
 
Hi, All.

Unfortunately, there is an on-going discussion about the new decimal 
correctness.

Although we fixed one correctness issue at master and backported it partially 
to 3.0/2.4, it turns out that it needs more patched to be complete.

Please see https://github.com/apache/spark/pull/29125 for on-going discussion 
for both 3.0/2.4.

    [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with 
overflowed value

I also confirmed that 2.4.7 RC1 is affected.

Bests,
Dongjoon.


On Thu, Aug 6, 2020 at 2:48 PM Sean Owen  wrote:
+1 from me. The same as usual. Licenses and sigs look OK, builds and
passes tests on a standard selection of profiles.

On Thu, Aug 6, 2020 at 7:07 AM Prashant Sharma  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.7.
>
> The vote is open until Aug 9th at 9AM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.7
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.7 (try project = SPARK AND 
> "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.7-rc1 (commit 
> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
> https://github.com/apache/spark/tree/v2.4.7-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1352/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/
>
> The list of bug fixes going into 2.4.7 can be found at the following URL:
> https://s.apache.org/spark-v2.4.7-rc1
>
> This release is using the release script of the tag v2.4.7-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up 

Renaming blacklisting feature input

2020-08-04 Thread Tom Graves
Hey Folks,
We have jira https://issues.apache.org/jira/browse/SPARK-32037 to rename the 
blacklisting feature.  It would be nice to come to a consensus on what we want 
to call that.It doesn't looks like we have any references to whitelist other 
then from other components.  There is some discussion on the jira and I linked 
to what some other projects have done so please take a look at that.
A few options: - blocklist - denylist - healthy /HealthTracker - quarantined - 
benched - exiled - banlist
Please let me know thoughts and suggestions.

Thanks,Tom

Removing references to Master

2020-08-04 Thread Tom Graves
Hey everyone,
I filed jira https://issues.apache.org/jira/browse/SPARK-32333 to remove 
references to Master.  I realize this is a bigger change then the slave jira 
but I wanted to get folks input on if they are ok with making the change and if 
so we would need to pick a name to use instead.  I think we should keep it 
backwards compatible at first as to not break anyone and depending on what we 
find might break it up into multiple smaller liras.
A few name possibilities: - ApplicationManager - StandaloneClusterManager - 
Coordinator - Primary - Controller
Thoughts or suggestions?
Thanks,Tom



Re: [DISCUSS] Amend the commiter guidelines on the subject of -1s & how we expect PR discussion to be treated.

2020-07-24 Thread Tom Graves
 +1
Tom
On Tuesday, July 21, 2020, 03:35:18 PM CDT, Holden Karau 
 wrote:  
 
 Hi Spark Developers,
There has been a rather active discussion regarding the specific vetoes that 
occured during Spark 3. From that I believe we are now mostly in agreement that 
it would be best to clarify our rules around code vetoes & merging in general. 
Personally I believe this change is important to help improve the appearance of 
a level playing field in the project.
Once discussion settles I'll run this by a copy editor, my grammar isn't 
amazing, and bring forward for a vote.
The current Spark committer guide is at 
https://spark.apache.org/committers.html. I am proposing we add a section on 
when it is OK to merge PRs directly above the section on how to merge PRs. The 
text I am proposing to amend our committer guidelines with is:

PRs shall not be merged during active on topic discussion except for issues 
like critical security fixes of a public vulnerability. Under extenuating 
circumstances PRs may be merged during active off topic discussion and the 
discussion directed to a more appropriate venue. Time should be given prior to 
merging for those involved with the conversation to explain if they believe 
they are on topic.


Lazy consensus requires giving time for discussion to settle, while 
understanding that people may not be working on Spark as their full time job 
and may take holidays. It is believed that by doing this we can limit how often 
people feel the need to exercise their veto.


For the purposes of a -1 on code changes, a qualified voter includes all PMC 
members and committers in the project. For a -1 to be a valid veto it must 
include a technical reason. The reason can include things like the change may 
introduce a maintenance burden or is not the direction of Spark.


If there is a -1 from a non-committer, multiple committers or the PMC should be 
consulted before moving forward.




If the original person who cast the veto can not be reached in a reasonable 
time frame given likely holidays, it is up to the PMC to decide the next steps 
within the guidelines of the ASF. This must be decided by a consensus vote 
under the ASF voting rules.


These policies serve to reiterate the core principle that code must not be 
merged with a pending veto or before a consensus has been reached (lazy or 
otherwise).


It is the PMC’s hope that vetoes continue to be infrequent, and when they occur 
all parties take the time to build consensus prior to additional feature work.




Being a committer means exercising your judgement, while working in a community 
with diverse views. There is nothing wrong in getting a second (or 3rd or 4th) 
opinion when you are uncertain. Thank you for your dedication to the Spark 
project, it is appreciated by the developers and users of Spark.




It is hoped that these guidelines do not slow down development, rather by 
removing some of the uncertainty that makes it easier for us to reach 
consensus. If you have ideas on how to improve these guidelines, or other parts 
of how the Spark project operates you should reach out on the dev@ list to 
start the discussion.





Kind Regards,
Holden
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
YouTube Live Streams: https://www.youtube.com/user/holdenkarau  

Re: [VOTE] Decommissioning SPIP

2020-07-06 Thread Tom Graves
 +1
Tom
On Wednesday, July 1, 2020, 08:05:47 PM CDT, Holden Karau 
 wrote:  
 
 Hi Spark Devs,
I think discussion has settled on the SPIP doc at 
https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
 , design doc at 
https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
 or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've received a 
request to put the SPIP up for a VOTE quickly. The discussion thread on the 
mailing list is at 
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html.
Normally this vote would be open for 72 hours, however since it's a long 
weekend in the US where many of the PMC members are, this vote will not close 
before July 6th at noon pacific time.
The SPIP procedures are documented at: 
https://spark.apache.org/improvement-proposals.html. The ASF's voting guide is 
at https://www.apache.org/foundation/voting.html.

Please vote before July 6th at noon:
[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this 
is a good idea because ...
I will start the voting off with a +1 from myself.
Cheers,
Holden  

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-30 Thread Tom Graves
 Stage Level Scheduling -  https://issues.apache.org/jira/browse/SPARK-27495

TomOn Monday, June 29, 2020, 11:07:18 AM CDT, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After a short celebration of Apache Spark 3.0, I'd like to ask you the 
community opinion on Apache Spark 3.1 feature expectations.
First of all, Apache Spark 3.1 is scheduled for December 2020.- 
https://spark.apache.org/versioning-policy.html
I'm expecting the following items:
1. Support Scala 2.132. Use Apache Hadoop 3.2 by default for better cloud 
support3. Declaring Kubernetes Scheduler GA    In my perspective, the last main 
missing piece was Dynamic allocation and    - Dynamic allocation with shuffle 
tracking is already shipped at 3.0.    - Dynamic allocation with worker 
decommission/data migration is targeting 3.1. (Thanks, Holden)4. DSv2 
Stabilization
I'm aware of some more features which are on the way currently, but I love to 
hear the opinions from the main developers and more over the main users who 
need those features.
Thank you in advance. Welcome for any comments.
Bests,Dongjoon.  

Re: [vote] Apache Spark 3.0 RC3

2020-06-17 Thread Tom Graves
 Reynold, 
What's the plan on pushing the official release binaries and source tar?  It 
would be nice to have the release artifacts now that it's available on maven.
thanks,Tom
On Monday, June 15, 2020, 01:52:12 PM CDT, Reynold Xin 
 wrote:  
 
 Thanks for the reminder, Dongjoon.

I created the official release tag the past weekend and been working on the 
release notes (a lot of interesting changes!). I've created a google docs so 
it's easier for everybody to give comment on things that I've missed: 
https://docs.google.com/document/d/1NrTqxf2f39AXDF8VTIch6kwD8VKPaIlLW1QvuqEcwR4/edit

Plan to publish to maven et al today or tomorrow and give a day or two for dev@ 
to comment on the release notes before finalizing.

PS: There are two critical problems I've seen with the release (Spark UI is 
virtually unusable in some cases, and streaming issues). I will highlight them 
in the release notes and link to the JIRA tickets. But I think we should make 
3.0.1 ASAP to follow up.



On Sun, Jun 14, 2020 at 11:46 AM, Dongjoon Hyun  wrote:

Hi, Reynold.
Is there any progress on 3.0.0 release since the vote was finalized 5 days ago?
Apparently, tag `v3.0.0` is not created yet, the binary and docs are still 
sitting on the voting location, Maven Central doesn't have it, and 
PySpark/SparkR uploading is not started yet.
    https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-bin/
    https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-docs/

Like Apache Spark 2.0.1 had 316 fixes after 2.0.0, we already have 35 patches 
on top of `v3.0.0-rc3` and are expecting more.
Although we can have Apache Spark 3.0.1 very soon before Spark+AI Summit, 
Apache Spark 3.0.0 should be available in Apache Spark distribution channel 
because it passed the vote.

Apache Spark 3.0.0 release itself helps the community use 3.0-line codebase and 
makes the codebase healthy.
Please let us know if you need any help from the community for 3.0.0 release.
Thanks,Dongjoon.

On Tue, Jun 9, 2020 at 9:41 PM Matei Zaharia  wrote:

Congrats! Excited to see the release posted soon.

On Jun 9, 2020, at 6:39 PM, Reynold Xin  wrote:



I waited another day to account for the weekend. This vote passes with the 
following +1 votes and no -1 votes!

I'll start the release prep later this week.

+1:
Reynold Xin (binding)
Prashant Sharma (binding)
Gengliang Wang
Sean Owen (binding)
Mridul Muralidharan (binding)
Takeshi Yamamuro
Maxim Gekk
Matei Zaharia (binding)
Jungtaek Lim
Denny Lee
Russell Spitzer
Dongjoon Hyun (binding)
DB Tsai (binding)
Michael Armbrust (binding)
Tom Graves (binding)
Bryan Cutler
Huaxin Gao
Jiaxin Shan
Xingbo Jiang
Xiao Li (binding)
Hyukjin Kwon (binding)
Kent Yao
Wenchen Fan (binding)
Shixiong Zhu (binding)
Burak Yavuz
Tathagata Das (binding)
Ryan Blue

-1: None



On Sat, Jun 06, 2020 at 1:08 PM, Reynold Xin  wrote:

Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http:/ / spark. apache. org/

The tag to be voted on is v3.0.0-rc3 (commit 
3fdfce3120f307147244e5eaf46d61419a723d50):
https:/ / github. com/ apache/ spark/ tree/ v3. 0. 0-rc3

The release files, including signatures, digests, etc. can be found at:
https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ KEYS

The staging repository for this release can be found at:
https:/ / repository. apache. org/ content/ repositories/ orgapachespark-1350/

The documentation corresponding to this release can be found at:
https:/ / dist. apache. org/ repos/ dist/ dev/ spark/ v3. 0. 0-rc3-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:
https:/ / issues. apache. org/ jira/ projects/ SPARK/ versions/ 12339177

This release is using the release script of the tag v3.0.0-rc3.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https:/ / issues. apache. org/ jira

Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Tom Graves
 +1
Tom
On Saturday, June 6, 2020, 03:09:09 PM CDT, Reynold Xin 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 3.0.0.

The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are 
cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-rc3 (commit 
3fdfce3120f307147244e5eaf46d61419a723d50):
https://github.com/apache/spark/tree/v3.0.0-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1350/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12339177

This release is using the release script of the tag v3.0.0-rc3.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 3.0.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


  

Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-03 Thread Tom Graves
  +1
Tom
On Sunday, May 31, 2020, 06:47:09 PM CDT, Holden Karau 
 wrote:  
 
 Please vote on releasing the following candidate as Apache Spark version 2.4.6.

The vote is open until June 5th at 9AM PST and passes if a majority +1 PMC 
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.6
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.6 (try project = SPARK AND "Target 
Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
The tag to be voted on is v2.4.6-rc8 (commit 
807e0a484d1de767d1f02bd8a622da6450bdf940):
https://github.com/apache/spark/tree/v2.4.6-rc8
The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-bin/
Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1349/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-docs/
The list of bug fixes going into 2.4.6 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12346781
This release is using the release script of the tag v2.4.6-rc8.

FAQ

=
What happened to the other RCs?=

The parallel maven build caused some flakiness so I wasn't comfortable 
releasing them. I backported the fix from the 3.0 branch for this release. I've 
got a proposed change to the build script so that we only push tags when once 
the build is a success for the future, but it does not block this release.
=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.6?
===

The current list of open tickets targeted at 2.4.6 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.4.6

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
YouTube Live Streams: https://www.youtube.com/user/holdenkarau  

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Tom Graves
 +1.
Tom
On Monday, May 18, 2020, 08:05:24 AM CDT, Wenchen Fan  
wrote:  
 
 +1, no known blockers.

On Mon, May 18, 2020 at 12:49 AM DB Tsai  wrote:

+1 as well. Thanks.
On Sun, May 17, 2020 at 7:39 AM Sean Owen  wrote:

+1 , same response as to the last RC.
This looks like it includes the fix discussed last time, as well as a
few more small good fixes.

On Sat, May 16, 2020 at 12:08 AM Holden Karau  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.6.
>
> The vote is open until May 22nd at 9AM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.6
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.6 (try project = SPARK AND 
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.6-rc3 (commit 
> 570848da7c48ba0cb827ada997e51677ff672a39):
> https://github.com/apache/spark/tree/v2.4.6-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1344/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-docs/
>
> The list of bug fixes going into 2.4.6 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>
> This release is using the release script of the tag v2.4.6-rc3.
>
> FAQ
>
> =
> What happened to RC2?
> =
>
> My computer crashed part of the way through RC2, so I rolled RC3.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.6?
> ===
>
> The current list of open tickets targeted at 2.4.6 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.6
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


-- 
- DBSent from my iPhone
  

Re: [DISCUSS] Java specific APIs design concern and choice

2020-05-11 Thread Tom Graves
 So as I've already stated and it looks like 2 others have issues with number 4 
as written as well, I'm against you posting this as is.  I do not think we 
should recommend 4 for public user facing Scala API.
Also note the page you linked is a Databricks page, while I know we reference 
it as a style guide, I do not believe we should be putting API policy on that 
page, it should live on an Apache Spark page.
I think if you want to implement an API policy like this it should go through 
an official vote thread, not just a discuss thread where we have not had a lot 
of feedback on it.
Tom


On Monday, May 11, 2020, 06:44:31 AM CDT, Hyukjin Kwon 
 wrote:  
 
 I will wait a couple of more days and if there's no objection I hear, I will 
document this at 
https://github.com/databricks/scala-style-guide#java-interoperability.
2020년 5월 7일 (목) 오후 9:18, Hyukjin Kwon 님이 작성:

Hi all, I would like to proceed this. Are there more thoughts on this? If not, 
I would like to go ahead with the proposal here.

2020년 4월 30일 (목) 오후 10:54, Hyukjin Kwon 님이 작성:
Nothing is urgent. I just don't want to leave it undecided and just keep adding 
Java APIs inconsistently as it's currently happening.
We should have a set of coherent APIs. It's very difficult to change APIs once 
they are out in releases. I guess I have seen people here agree with having a 
general guidance for the same reason at least - please let me know if I'm 
taking it wrong.
I don't think we should assume Java programmers know how Scala works with Java 
types. Less assumtion might be better.
I feel like we have things on the table to consider at this moment and not much 
point of waiting indefinitely.
But sure maybe I am wrong. We can wait for more feedback for a couple of days.

On Thu, 30 Apr 2020, 18:59 ZHANG Wei,  wrote:

I feel a little pushed... :-) I still don't get the point of why it's
urgent to make the decision now. AFAIK, it's a common practice to handle
Scala types conversions by self when Java programmers prepare to
invoke Scala libraries. I'm not sure which one is the Java programmers'
root complaint, Scala type instance or Scala Jar file.

My 2 cents.

-- 
Cheers,
-z

On Thu, 30 Apr 2020 09:17:37 +0900
Hyukjin Kwon  wrote:

> There was a typo in the previous email. I am re-sending:
> 
> Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> I don't mean to wait for more feedback. It looks likely just a deadlock
> which will be the worst case.
> I was suggesting to pick one way first, and stick to it. If we find out
> something later, we can discuss
> more about changing it later.
> 
> Having separate Java specific API (3. way)
>   - causes maintenance cost
>   - makes users to search which API for Java every time
>   - this looks the opposite why against the unified API set Spark targeted
> so far.
> 
> I don't completely buy the argument about Scala/Java friendly because using
> Java instance is already documented in the official Scala documentation.
> Users still need to search if we have Java specific methods for *some* APIs.
> 
> 2020년 4월 30일 (목) 오전 8:58, Hyukjin Kwon 님이 작성:
> 
> > Hm, I thought you meant you prefer 3. over 4 but don't mind particularly.
> > I don't mean to wait for more feedback. It looks likely just a deadlock
> > which will be the worst case.
> > I was suggesting to pick one way first, and stick to it. If we find out
> > something later, we can discuss
> > more about changing it later.
> >
> > Having separate Java specific API (4. way)
> >   - causes maintenance cost
> >   - makes users to search which API for Java every time
> >   - this looks the opposite why against the unified API set Spark targeted
> > so far.
> >
> > I don't completely buy the argument about Scala/Java friendly because
> > using Java instance is already documented in the official Scala
> > documentation.
> > Users still need to search if we have Java specific methods for *some*
> > APIs.
> >
> >
> >
> > On Thu, 30 Apr 2020, 00:06 Tom Graves,  wrote:
> >
> >> Sorry I'm not sure what your last email means. Does it mean you are
> >> putting it up for a vote or just waiting to get more feedback?  I disagree
> >> with saying option 4 is the rule but agree having a general rule makes
> >> sense.  I think we need a lot more input to make the rule as it affects the
> >> api's.
> >>
> >> Tom
> >>
> >> On Wednesday, April 29, 2020, 09:53:22 AM CDT, Hyukjin Kwon <
> >> gurwls...@gmail.com> wrote:
> >>
> >>
> >> I think I am not seeing explicit objection here but rather see people
> >> tend to agree with the proposal in general.
> >> I would like to step forward rather than leaving it as a deadlo

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-29 Thread Tom Graves
.
> > > >>>
> > > >>> On Mon, 27 Apr 2020, 23:15 Wenchen Fan,  wrote:
> > > >>>
> > > >>>> IIUC We are moving away from having 2 classes for Java and Scala,
> > like
> > > >>>> JavaRDD and RDD. It's much simpler to maintain and use with a
> > single class.
> > > >>>>
> > > >>>> I don't have a strong preference over option 3 or 4. We may need to
> > > >>>> collect more data points from actual users.
> > > >>>>
> > > >>>> On Mon, Apr 27, 2020 at 9:50 PM Hyukjin Kwon 
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Scala users are arguably more prevailing compared to Java users,
> > yes.
> > > >>>>> Using the Java instances in Scala side is legitimate, and they are
> > > >>>>> already being used in multiple please. I don't believe Scala
> > > >>>>> users find this not Scala friendly as it's legitimate and already
> > > >>>>> being used. I personally find it's more trouble some to let Java
> > > >>>>> users to search which APIs to call. Yes, I understand the pros and
> > > >>>>> cons - we should also find the balance considering the actual
> > usage.
> > > >>>>>
> > > >>>>> One more argument from me is, though, I think one of the goals in
> > > >>>>> Spark APIs is the unified API set up to my knowledge
> > > >>>>>  e.g., JavaRDD <> RDD vs DataFrame.
> > > >>>>> If either way is not particularly preferred over the other, I would
> > > >>>>> just choose the one to have the unified API set.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> 2020년 4월 27일 (월) 오후 10:37, Tom Graves 님이 작성:
> > > >>>>>
> > > >>>>>> I agree a general guidance is good so we keep consistent in the
> > apis.
> > > >>>>>> I don't necessarily agree that 4 is the best solution though.  I
> > agree its
> > > >>>>>> nice to have one api, but it is less friendly for the scala side.
> > > >>>>>> Searching for the equivalent Java api shouldn't be hard as it
> > should be
> > > >>>>>> very close in the name and if we make it a general rule users
> > should
> > > >>>>>> understand it.   I guess one good question is what API do most of
> > our users
> > > >>>>>> use between Java and Scala and what is the ratio?  I don't know
> > the answer
> > > >>>>>> to that. I've seen more using Scala over Java.  If the majority
> > use Scala
> > > >>>>>> then I think the API should be more friendly to that.
> > > >>>>>>
> > > >>>>>> Tom
> > > >>>>>>
> > > >>>>>> On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon <
> > > >>>>>> gurwls...@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> I would like to discuss Java specific APIs and which design we
> > will
> > > >>>>>> choose.
> > > >>>>>> This has been discussed in multiple places so far, for example, at
> > > >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F28085%23discussion_r407334754data=02%7C01%7C%7C7f0d8171d15848afb10c08d7eb215530%7C84df9e7fe9f640afb435%7C1%7C0%7C637236400701707166sdata=6A82CT7n4FwG6f1Hx3%2FqmetQVSGWlrcE7BHDx0LLwTo%3Dreserved=0
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> *The problem:*
> > > >>>>>>
> > > >>>>>> In short, I would like us to have clear guidance on how we support
> > > >>>>>> Java specific APIs when
> > > >>>>>> it requires to return a Java instance. The problem is simple:
> > > >>>>>>
> > > >>>>>> def requests: Map[String, ExecutorResourceRequest] = ...
> > > >>>>>> def requestsJMap: 

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread Tom Graves
 I agree a general guidance is good so we keep consistent in the apis. I don't 
necessarily agree that 4 is the best solution though.  I agree its nice to have 
one api, but it is less friendly for the scala side.  Searching for the 
equivalent Java api shouldn't be hard as it should be very close in the name 
and if we make it a general rule users should understand it.   I guess one good 
question is what API do most of our users use between Java and Scala and what 
is the ratio?  I don't know the answer to that. I've seen more using Scala over 
Java.  If the majority use Scala then I think the API should be more friendly 
to that.
Tom
On Monday, April 27, 2020, 04:04:28 AM CDT, Hyukjin Kwon 
 wrote:  
 
 
Hi all,

I would like to discuss Java specific APIs and which design we will choose.
This has been discussed in multiple places so far, for example, at
https://github.com/apache/spark/pull/28085#discussion_r407334754


The problem:

In short, I would like us to have clear guidance on how we support Java 
specific APIs when
it requires to return a Java instance. The problem is simple:
def requests: Map[String, ExecutorResourceRequest] = ...
def requestsJMap: java.util.Map[String, ExecutorResourceRequest] = ...

vs
def requests: java.util.Map[String, ExecutorResourceRequest] = ...


Current codebase:

My understanding so far was that the latter is preferred and more consistent 
and prevailing in the
existing codebase, for example, see StateOperatorProgress and 
StreamingQueryProgress in Structured Streaming.
However, I realised that we also have other approaches in the current codebase. 
There look
four approaches to deal with Java specifics in general:
   
   - Java specific classes such as JavaRDD and JavaSparkContext.
   - Java specific methods with the same name that overload its parameters, see 
functions.scala.
   - Java specific methods with a different name that needs to return a 
different type such as TaskContext.resourcesJMap vs  TaskContext.resources.
   - One method that returns a Java instance for both Scala and Java sides. see 
StateOperatorProgress and StreamingQueryProgress.   



Analysis on the current codebase:

I agree with 2. approach because the corresponding cases give you a consistent 
API usage across
other language APIs in general. Approach 1. is from the old world when we 
didn't have unified APIs.
This might be the worst approach.

3. and 4. are controversial.

For 3., if you have to use Java APIs, then, you should search if there is a 
variant of that API
every time specifically for Java APIs. But yes, it gives you Java/Scala 
friendly instances.

For 4., having one API that returns a Java instance makes you able to use it in 
both Scala and Java APIs
sides although it makes you call asScala in Scala side specifically. But you 
don’t
have to search if there’s a variant of this API and it gives you a consistent 
API usage across languages.

Also, note that calling Java in Scala is legitimate but the opposite case is 
not, up to my best knowledge.
In addition, you should have a method that returns a Java instance for PySpark 
or SparkR to support.


Proposal:

I would like to have a general guidance on this that the Spark dev agrees upon: 
Do 4. approach. If not possible, do 3. Avoid 1 almost at all cost.

Note that this isn't a hard requirement but a general guidance; therefore, the 
decision might be up to
the specific context. For example, when there are some strong arguments to have 
a separate Java specific API, that’s fine.
Of course, we won’t change the existing methods given Micheal’s rubric added 
before. I am talking about new
methods in unreleased branches.

Any concern or opinion on this?
  

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-10 Thread Tom Graves
 Overall makes sense to me, but have same questions as others on the thread.
Is this only applying to stable apis? How are we going to apply to 3.0?
the way I read this proposal isn't really saying we can't break api's on major 
releases, its just saying spend more time making sure its worth it.Tom
On Friday, March 6, 2020, 08:59:03 PM CST, Michael Armbrust 
 wrote:  
 
 
I propose to add the following text to Spark's Semantic Versioning policy and 
adopt it as the rubric that should be used when deciding to break APIs (even at 
major versions such as 3.0).




I'll leave the vote open until Tuesday, March 10th at 2pm. As this is a 
procedural vote, the measure will pass if there are more favourable votes than 
unfavourable ones. PMC votes are binding, but the community is encouraged to 
add their voice to the discussion.




[ ] +1 - Spark should adopt this policy.

[ ] -1  - Spark should not adopt this policy.









Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, 
even at major versions. While this is not always possible, the balance of the 
following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A 
broken API means that Spark programs need to be rewritten before they can be 
upgraded. However, there are a few considerations when thinking about what the 
cost will be:
   
   -
Usage - an API that is actively used in many different places, is always very 
costly to break. While it is hard to know usage for sure, there are a bunch of 
ways that we can estimate: 

   
   -
How long has the API been in Spark?

   -
Is the API common even for basic programs?

   -
How often do we see recent questions in JIRA or mailing lists?

   -
How often does it appear in StackOverflow or blogs?

   
   -
Behavior after the break - How will a program that works today, work after the 
break? The following are listed roughly in order of increasing severity:

   
   -
Will there be a compiler or linker error?

   -
Will there be a runtime exception?

   -
Will that exception happen after significant processing has been done?

   -
Will we silently return different answers? (very hard to debug, might not even 
notice!)





Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We must 
also consider the cost both to the project and to our users of keeping the API 
in question.
   
   -
Project Costs - Every API we have needs to be tested and needs to keep working 
as other parts of the project changes. These costs are significantly 
exacerbated when external dependencies change (the JVM, Scala, etc). In some 
cases, while not completely technically infeasible, the cost of maintaining a 
particular API can become too high.

   -
User Costs - APIs also have a cognitive cost to users learning Spark or trying 
to understand Spark programs. This cost becomes even higher when the API in 
question has confusing or undefined semantics.



Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also 
high, there are alternatives that should be considered that do not hurt 
existing users but do address some of the maintenance costs.

   
   -
Avoid Bad APIs - While this is a bit obvious, it is an important point. Anytime 
we are adding a new interface to Spark we should consider that we might be 
stuck with this API forever. Think deeply about how new APIs relate to existing 
ones, as well as how you expect them to evolve over time.

   -
Deprecation Warnings - All deprecation warnings should point to a clear 
alternative and should never just say that an API is deprecated.

   -
Updated Docs - Documentation should point to the "best" recommended way of 
performing a given task. In the cases where we maintain legacy documentation, 
we should clearly point to newer APIs and suggest to users the "right" way.

   -
Community Work - Many people learn Spark by reading blogs and other sites such 
as StackOverflow. However, many of these resources are out of date. Update 
them, to reduce the cost of eventually removing deprecated APIs.


  

Re: GitHub action permissions

2020-02-28 Thread Tom Graves
 No, I couldn't see that button, looks like the process of syncing in gitbox 
didn't finish with my accounts.  I finished that and its working now.
Thanks,Tom
On Friday, February 28, 2020, 09:39:12 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, Thomas.
If you log-in with a GitHub account registered Apache project member, it will 
be enough.
On some PRs of Apache Spark, can you see 'Squash and merge'  button?
Bests,Dongjoon
On Fri, Feb 28, 2020 at 07:15 Thomas graves  wrote:

Does anyone know how the GitHub action permissions are setup?

I see a lot of random failures and want to be able to rerun them, but
I don't seem to have a "rerun" button like some folks do.

Thanks,
Tom

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


  

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Tom Graves
 In general +1 I think these are good guidelines and making it easier to 
upgrade is beneficial to everyone.  The decision needs to happen at api/config 
change time, otherwise the deprecated warning has no purpose if we are never 
going to remove them.That said we still need to be able to remove deprecated 
things and change APIs in major releases, otherwise why do a  major release in 
the first place.  Is it purely to support newer Scala/python/java versions.
I think the hardest part listed here is what the impact is.  Who's call is 
that, it's hard to know how everyone is using things and I think it's been 
harder to get feedback on SPIPs and API changes in general as people are busy 
with other things. Like you mention, I think stackoverflow is unreliable, the 
posts could be many years old and no longer relevant. 
TomOn Monday, February 24, 2020, 05:03:44 PM CST, Michael Armbrust 
 wrote:  
 
 
Hello Everyone,


As more users have started upgrading to Spark 3.0 preview (including myself), 
there have been many discussions around APIs that have been broken compared 
with Spark 2.x. In many of these discussions, one of the rationales for 
breaking an API seems to be "Spark follows semantic versioning, so this major 
release is our chance to get it right [by breaking APIs]". Similarly, in many 
cases the response to questions about why an API was completely removed has 
been, "this API has been deprecated since x.x, so we have to remove it".


As a long time contributor to and user of Spark this interpretation of the 
policy is concerning to me. This reasoning misses the intention of the original 
policy, and I am worried that it will hurt the long-term success of the project.


I definitely understand that these are hard decisions, and I'm not proposing 
that we never remove anything from Spark. However, I would like to give some 
additional context and also propose a different rubric for thinking about API 
breakage moving forward.


Spark adopted semantic versioning back in 2014 during the preparations for the 
1.0 release. As this was the first major release -- and as, up until fairly 
recently, Spark had only been an academic project -- no real promises had been 
made about API stability ever.


During the discussion, some committers suggested that this was an opportunity 
to clean up cruft and give the Spark APIs a once-over, making cosmetic changes 
to improve consistency. However, in the end, it was decided that in many cases 
it was not in the best interests of the Spark community to break things just 
because we could. Matei actually said it pretty forcefully:


I know that some names are suboptimal, but I absolutely detest breaking APIs, 
config names, etc. I’ve seen it happen way too often in other projects (even 
things we depend on that are officially post-1.0, like Akka or Protobuf or 
Hadoop), and it’s very painful. I think that we as fairly cutting-edge users 
are okay with libraries occasionally changing, but many others will consider it 
a show-stopper. Given this, I think that any cosmetic change now, even though 
it might improve clarity slightly, is not worth the tradeoff in terms of 
creating an update barrier for existing users.


In the end, while some changes were made, most APIs remained the same and users 
of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think this served 
the project very well, as compatibility means users are able to upgrade and we 
keep as many people on the latest versions of Spark (though maybe not the 
latest APIs of Spark) as possible.


As Spark grows, I think compatibility actually becomes more important and we 
should be more conservative rather than less. Today, there are very likely more 
Spark programs running than there were at any other time in the past. Spark is 
no longer a tool only used by advanced hackers, it is now also running 
"traditional enterprise workloads.'' In many cases these jobs are powering 
important processes long after the original author leaves.


Broken APIs can also affect libraries that extend Spark. This dependency can be 
even harder for users, as if the library has not been upgraded to use new APIs 
and they need that library, they are stuck.


Given all of this, I'd like to propose the following rubric as an addition to 
our semantic versioning policy. After discussion and if people agree this is a 
good idea, I'll call a vote of the PMC to ratify its inclusion in the official 
policy.


Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing behavior, 
even at major versions. While this is not always possible, the balance of the 
following factors should be considered before choosing to break an API.


Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark. A 
broken API means that Spark programs need to be rewritten before they can be 
upgraded. However, there are a few considerations when thinking about 

Re: Apache Spark Docker image repository

2020-02-06 Thread Tom Graves
 When discussions of docker have occurred in the past - mostly related to k8s - 
there is a lot of discussion about what is the right image to publish, as well 
as making sure Apache is ok with it. Apache official release is the source code 
so we may need to make sure to have disclaimer and we need to make sure it 
doesn't contain anything licensed that it shouldn't.  What happens when one of 
the docker images we publish has security update. We would need to make sure 
all the legal bases are covered first.  
Then the discussion comes into what is in the docker images and how useful it 
is. People run different os's, different python versions, etc. And like Sean 
mentioned how useful really is it other then a few examples.  Some discussions 
on https://issues.apache.org/jira/browse/SPARK-24655
Tom


On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
>From 2020, shall we have an official Docker image repository as an additional 
>distribution channel?
I'm considering the following images.
    - Public binary release (no snapshot image)    - Public non-Spark base 
image (OS + R + Python)      (This can be used in GitHub Action Jobs and 
Jenkins K8s Integration Tests to speed up jobs and to have more stabler 
environments)
Bests,Dongjoon.  

Re: `Target Version` management on correctness/data-loss Issues

2020-01-28 Thread Tom Graves
 I was just thinking an info email  (perhaps tagged with correctness/dataloss) 
to dev rather than an official vote, that way its more visible and if anyone 
sees it and disagrees with the targeting it can be discussed on that thread.  
It might also just bring more visibility to those important issues and get 
people interesting in working on them sooner.
Tom
On Monday, January 27, 2020, 02:31:03 PM CST, Dongjoon Hyun 
 wrote:  
 
 Yes. That is what I pointed in `Unfortunately, we didn't build a consensus on 
what is really blocked by that.` If you are suggesting a vote, do you mean a 
majority-win vote or an unanimous decision? Will it be a permanent decision?
> I think the other interesting thing here is how exactly to come to agreement 
> on whether it needs to be fixed in a particular release. Like we have been 
> discussing on SPARK-29701. This could be a matter of opinion, so should we do 
> something like mail the dev list whenever one of these issues is tagged if 
> its not going to be back ported to an affected release?

The following seems to happen when the committers initially think like "Seems 
behavioral to me and its been consistent so seems ok to skip for 2.4.5"
For example, SPARK-27619 MapType should be prohibited in hash expressions.

> A) I'm not clear on this one as to why affected and target would be different 
> initially, 
BTW, in this email thread, I'm focusing on the `Target Version` management.That 
is the only way to detect the community decision change.
Bests,Dongjoon.
On Mon, Jan 27, 2020 at 11:12 AM Tom Graves  wrote:

 thanks for bringing this up.
A) I'm not clear on this one as to why affected and target would be different 
initially, other then the reasons target versions != fixed versions.  Is the 
intention here just to say, if its already been discussed and came to consensus 
not needed in certain release?  The only other obvious time is in spark 
releases that are no longer maintained.
I think the other interesting thing here is how exactly to come to agreement on 
whether it needs to be fixed in a particular release. Like we have been 
discussing on SPARK-29701. This could be a matter of opinion, so should we do 
something like mail the dev list whenever one of these issues is tagged if its 
not going to be back ported to an affected release?
TomOn Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After 2.4.5 RC1 vote failure, I asked your opinions about correctness/dataloss 
issues (at mailing lists/JIRAs/PRs) in order to collect the current status and 
public opinion widely in the community to build a consensus on this at this 
time.
Before talking about those issues, please remind that
    - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and 
3.0.0 is not released.    - Apache Spark community has the following rule: 
"Correctness and data loss issues should be considered Blockers."
Unfortunately, we didn't build a consensus on what is really blocked by that. 
In reality, it was just our resolution for the quality and it works a little 
differently.
In this email, I want to talk about correctness/dataloss issues and observed 
public opinions. They fall into the following categories roughly.
1. Resolved in both 3.0.0 and 2.4.x   - ex) SPARK-30447 Constant propagation 
nullability issue   - No problem. However, this case sometimes goes to (2)
2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.   - ex) 
SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match Hive   - "We 
don't want to change the behavior in the maintenence release"
3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.   - ex) 
SPARK-29906 Reading of csv file fails with adaptive execution turned on   - No 
problem.
4. Resolved in 3.0.0 and not backported due to technical difficulty.   - ex) 
SPARK-26154 Stream-stream joins - left outer join gives inconsistent output   - 
"This is not backported due to the technical difficulty"
5. Resolved in 3.0.0 and not backported because this is not public API.   - ex) 
SPARK-29503 MapObjects doesn't copy Unsafe data when nested under Safe data   - 
"Since `catalyst` is not public, it's less worth backporting this."
6. Resolved in 3.0.0 and not backported because we forget since there was a no 
Target Version.   - ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent   
- "Adding the 'correctness' label so we remember to backport this fix to 
2.4.x."   - "This is possible, if users add the rule into 
postHocOptimizationBatches"
7. Open with Target Version 3.0.0.   - ex) SPARK-29701 Correct behaviours of 
group analytical queries when empty input given   - "We aren't fully SQL 
compliant there and I think that has been true since the beginning of spark 
sql"   - "This is not a regression"
8. Open without Target Version.   - I removed this case last week t

Re: `Target Version` management on correctness/data-loss Issues

2020-01-27 Thread Tom Graves
 thanks for bringing this up.
A) I'm not clear on this one as to why affected and target would be different 
initially, other then the reasons target versions != fixed versions.  Is the 
intention here just to say, if its already been discussed and came to consensus 
not needed in certain release?  The only other obvious time is in spark 
releases that are no longer maintained.
I think the other interesting thing here is how exactly to come to agreement on 
whether it needs to be fixed in a particular release. Like we have been 
discussing on SPARK-29701. This could be a matter of opinion, so should we do 
something like mail the dev list whenever one of these issues is tagged if its 
not going to be back ported to an affected release?
TomOn Sunday, January 26, 2020, 11:22:13 PM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
After 2.4.5 RC1 vote failure, I asked your opinions about correctness/dataloss 
issues (at mailing lists/JIRAs/PRs) in order to collect the current status and 
public opinion widely in the community to build a consensus on this at this 
time.
Before talking about those issues, please remind that
    - Apache Spark 2.4.x is the only live version because 2.3.x is EOL and 
3.0.0 is not released.    - Apache Spark community has the following rule: 
"Correctness and data loss issues should be considered Blockers."
Unfortunately, we didn't build a consensus on what is really blocked by that. 
In reality, it was just our resolution for the quality and it works a little 
differently.
In this email, I want to talk about correctness/dataloss issues and observed 
public opinions. They fall into the following categories roughly.
1. Resolved in both 3.0.0 and 2.4.x   - ex) SPARK-30447 Constant propagation 
nullability issue   - No problem. However, this case sometimes goes to (2)
2. Resolved in both 3.0.0 and 2.4.x. But, reverted in 2.4.x later.   - ex) 
SPARK-26021 -0.0 and 0.0 not treated consistently, doesn't match Hive   - "We 
don't want to change the behavior in the maintenence release"
3. Resolved in 3.0.0 and not backported because this is 3.0.0-specific.   - ex) 
SPARK-29906 Reading of csv file fails with adaptive execution turned on   - No 
problem.
4. Resolved in 3.0.0 and not backported due to technical difficulty.   - ex) 
SPARK-26154 Stream-stream joins - left outer join gives inconsistent output   - 
"This is not backported due to the technical difficulty"
5. Resolved in 3.0.0 and not backported because this is not public API.   - ex) 
SPARK-29503 MapObjects doesn't copy Unsafe data when nested under Safe data   - 
"Since `catalyst` is not public, it's less worth backporting this."
6. Resolved in 3.0.0 and not backported because we forget since there was a no 
Target Version.   - ex) SPARK-28375 Make pullupCorrelatedPredicate idempotent   
- "Adding the 'correctness' label so we remember to backport this fix to 
2.4.x."   - "This is possible, if users add the rule into 
postHocOptimizationBatches"
7. Open with Target Version 3.0.0.   - ex) SPARK-29701 Correct behaviours of 
group analytical queries when empty input given   - "We aren't fully SQL 
compliant there and I think that has been true since the beginning of spark 
sql"   - "This is not a regression"
8. Open without Target Version.   - I removed this case last week to give more 
visibility on them.
Here, I want to focus that Apache Spark is a very healthy community because we 
have diverse opinions and reevaluating JIRA issues are the results of the 
community decision based on the discusson. I believe that it will go well 
eventually. In the above, I added those example JIRA IDs and the collected 
reasons just to give some colors to illustrate all cases are the real cases. 
There is no case to be blamed in the above.
  
Although some JIRA issues will jump from one category into another category 
time to time, the categories will remain there. I want to propose a small 
additional work on `Target Version` to distinguish the above categories easily 
to communicate clearly in the community. This should be done by committers 
because we have the following policy on `Target Version`.
    "Target Version. This is assigned by committers to indicate a PR has been 
accepted for possible fix by the target version."
Proposed Idea:    A. To reduce the mismatch between `Target Version` vs 
`Affected Version`:       When a committer set `correctness` or `data-loss` 
label, `Target Version` should be set together according to the `Affected 
Versions`.       In case of the insufficient `Target Version` (e.g. `Target 
Version`=`3.0.0` for `Affected Version`=`2.4.4,3.0.0`), he/she need to add a 
comment on the JIRA.       For example, "This is 3.0.0-specific issue"
    B. To reduce the mismatch between `Target Version` vs `Fixed Version`:      
 When a committer resolve `correctness` or `data-loss` labeled issue, `Target 
Version` should be compared with `Fixed Version`.       In case of the 
insufficient `Fixed Version` (e.g. `Target 

Re: Correctness and data loss issues

2020-01-22 Thread Tom Graves
 My thoughts on your list, would be good to get people who worked on these 
issues input. Obviously we can weigh the importance of these vs getting 2.4.5 
out that has a bunch of other correctness fixes you mention as well.  I think 
you have already pinged on most of the jira to get feedback.

 SPARK-30218 Columns used in inequality conditions for joins not resolved 
correctly in case of common lineageYou already linked to SPARK-28344 and asked 
the question about back port
    SPARK-29701 Different answers when empty input given in GROUPING SETSThis 
seems like Postgres compatibility thing again not a correctness issue
    SPARK-29699 Different answers in nested aggregates with window 
functionsThis seems like Postgres compatibility thing again not a correctness 
issue
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe This is 
currently listed as an improvement and I can see an argument user has to 
explicitly do this in separate threads so seems less critical to me though 
definitely nice to fix. personally think its ok to not have in 2.4.5
    SPARK-28125 dataframes created by randomSplit have overlapping rowsSeems 
like something we should fix
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code 
gen enabledSeems like we should fix
    SPARK-28024 Incorrect numeric values when out of rangeSeems like we could 
skip for 2.4.5 and some overflow exceptions fixed in 3.0
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable 
expressionsWould be good to understand what fixed in 3.0 to see if can back port
    SPARK-27619 MapType should be prohibited in hash expressionsSeems 
behavioral to me and its been consistent so seems ok to skip for 2.4.5
    SPARK-27298 Dataset except operation gives different results(dataset count) 
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environmentSeems to be a windows 
vs linux issue and seems like we should investigate
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY 
clauseSimilar seems to be fixed in spark 3.0 so need to see if we can back port 
if we can find what fixed
    SPARK-27213 Unexpected results when filter is used after distinctNeed to 
try to reproduce on 2.4.X
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table 
if schema evolvesSeems like we should investigate further for 2.4.x fix
    SPARK-25150 Joining DataFrames derived from the same source yields 
confusing/incorrect resultsSeems like we should investigate further for 2.4.x 
fix
    SPARK-21774 The rule PromoteStrings cast string to a wrong data typeSeems 
like we should investigate further for 2.4.x fix
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0
Seems wrong but if its been consistent for the entire 2.0 may be ok to skip for 
2.4.x
TomOn Wednesday, January 22, 2020, 11:43:30 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, Tom.
Then, along with the following, do you think we need to hold on 2.4.5 release, 
too?
> If it's really a correctness issue we should hold 3.0 for it.
Recently,
    (1) 2.4.4 delivered 9 correctness patches.
    (2) 2.4.5 RC1 aimed to deliver the following 9 correctness patches, too.
        SPARK-29101 CSV datasource returns incorrect .count() from file with 
malformed records
        SPARK-30447 Constant propagation nullability issue
        SPARK-29708 Different answers in aggregates of duplicate grouping sets
        SPARK-29651 Incorrect parsing of interval seconds fraction
        SPARK-29918 RecordBinaryComparator should check endianness when 
compared by long
        SPARK-29042 Sampling-based RDD with unordered input should be 
INDETERMINATE
        SPARK-30082 Zeros are being treated as NaNs
        SPARK-29743 sample should set needCopyResult to true if its child is
        SPARK-26985 Test "access only some column of the all of columns " fails 
on big endian

Without the official Apache Spark 2.4.5 binaries,there is no official way to 
deliver the 9 correctness fixes in (2) to the users.
In addition, usually, the correctness fixes are independent to each other.
Bests,
Dongjoon.

On Wed, Jan 22, 2020 at 7:02 AM Tom Graves  wrote:

 I agree, I think we just need to go through all of them and individual assess 
each one. If it's really a correctness issue we should hold 3.0 for it.
On the 2.4 release I didn't see an explanation on  
https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, 
I think in the very least we need that in each jira comment.
spark-29701 looks more like compatibility with Postgres then a purely wrong 
answer to me, if Spark has been consistent about that it feels like it can wait 
for 3.0 but would be good to get others input and I'm not an expert on SQL 
standard and what do the other sql engines do in this case.
Tom
On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
According to our policy, "Correctness and data loss issues should be co

Re: Adding Maven Central mirror from Google to the build?

2020-01-22 Thread Tom Graves
 +1 for proposal.
Tom
On Tuesday, January 21, 2020, 04:37:04 PM CST, Sean Owen  
wrote:  
 
 See https://github.com/apache/spark/pull/27307 for some context. We've
had to add, in at least one place, some settings to resolve artifacts
from a mirror besides Maven Central to work around some build
problems.

Now, we find it might be simpler to just use this mirror as the
primary repo in the build, falling back to Central if needed.

The question is: any objections to that?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

  

Re: Correctness and data loss issues

2020-01-22 Thread Tom Graves
 I agree, I think we just need to go through all of them and individual assess 
each one. If it's really a correctness issue we should hold 3.0 for it.
On the 2.4 release I didn't see an explanation on  
https://issues.apache.org/jira/browse/SPARK-26154 why it can't be back ported, 
I think in the very least we need that in each jira comment.
spark-29701 looks more like compatibility with Postgres then a purely wrong 
answer to me, if Spark has been consistent about that it feels like it can wait 
for 3.0 but would be good to get others input and I'm not an expert on SQL 
standard and what do the other sql engines do in this case.
Tom
On Monday, January 20, 2020, 12:07:54 AM CST, Dongjoon Hyun 
 wrote:  
 
 Hi, All.
According to our policy, "Correctness and data loss issues should be considered 
Blockers".

    - http://spark.apache.org/contributing.html
Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss issues.

    SPARK-30218 Columns used in inequality conditions for joins not resolved 
correctly in case of common lineage
    SPARK-29701 Different answers when empty input given in GROUPING SETS
    SPARK-29699 Different answers in nested aggregates with window functions
    SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
    SPARK-28125 dataframes created by randomSplit have overlapping rows
    SPARK-28067 Incorrect results in decimal aggregation with whole-stage code 
gen enabled
    SPARK-28024 Incorrect numeric values when out of range
    SPARK-27784 Alias ID reuse can break correctness when substituting foldable 
expressions
    SPARK-27619 MapType should be prohibited in hash expressions
    SPARK-27298 Dataset except operation gives different results(dataset count) 
on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
    SPARK-27282 Spark incorrect results when using UNION with GROUP BY clause
    SPARK-27213 Unexpected results when filter is used after distinct
    SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive table 
if schema evolves
    SPARK-25150 Joining DataFrames derived from the same source yields 
confusing/incorrect results
    SPARK-21774 The rule PromoteStrings cast string to a wrong data type
    SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,I'm not sure we can reach a status 
with no known correctness and data loss issue.
How do you think about the above issues?
Bests,Dongjoon.  

Re: PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
 Sorry should have included the link. It shows up in the pre checks failures, 
but the tests still run and pass. For 
instance:https://github.com/apache/spark/pull/26682

more:https://github.com/apache/spark/pull/27240/checks?check_run_id=393888081
https://github.com/apache/spark/pull/27233/checks?check_run_id=393123209
https://github.com/apache/spark/pull/27239/checks?check_run_id=393884643

TomOn Thursday, January 16, 2020, 03:17:03 PM CST, Shane Knapp 
 wrote:  
 
 i'm seeing a lot of green builds currently...  if you think this is
still happening, please include links to the failed jobs.  thanks!

shane (at a conference)

On Thu, Jan 16, 2020 at 11:16 AM Tom Graves  wrote:
>
> I'm seeing the scala-lint jobs fail on the pull request builds with:
>
> [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
> org.apache:apache:pom:18 from/to central ( 
> http://repo.maven.apache.org/maven2): Error transferring file: Server 
> returned HTTP response code: 501 for URL: 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
> http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
> 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> It seems we are hitting the http endpoint vs the https one. Our pom file 
> already has the repo as the https version though.
>
> Anyone know why its trying to go to http version?
>
>
> Tom



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
  

PR lint-scala jobs failing with http error

2020-01-16 Thread Tom Graves
I'm seeing the scala-lint jobs fail on the pull request builds with:
[error] [FATAL] Non-resolvable parent POM: Could not transfer artifact 
org.apache:apache:pom:18 from/to central ( 
http://repo.maven.apache.org/maven2): Error transferring file: Server returned 
HTTP response code: 501 for URL: 
http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom from 
http://repo.maven.apache.org/maven2/org/apache/apache/18/apache-18.pom and 
'parent.relativePath' points at wrong local POM @ line 22, column 11

It seems we are hitting the http endpoint vs the https one. Our pom file 
already has the repo as the https version though.
Anyone know why its trying to go to http version?

Tom

Reviewers for Stage level Scheduling prs

2020-01-08 Thread Tom Graves
Hey everyone,
I'm trying to get reviewers for the Stage Level Scheduling pull requests.  I 
was hoping to get this into Spark 3.0.The code is mostly complete - its just 
missing the webUI and final doc changes.
If anyone has time, reviews from committers would be appreciated.
You can find information about the overall feature and design 
here:[SPARK-27495] SPIP: Support Stage level resource configuration and 
scheduling - ASF JIRA it has Stage Level Scheduling SPIP Appendices 
API/Designattached with a high level overview.

I have a reference pr with most of the code implemented: 
[WIP][SPARK-27495][Core][YARN][k8s] Stage Level Scheduling code for reference 
by tgravescs · Pull Request #27053 · apache/spark

and I've been trying to break that into smaller pieces for easier review - this 
is the current one: [SPARK-29306][CORE] Stage Level Sched: Executors need to 
track what ResourceProfile they are created with by tgravescs · Pull Request 
#26682 · apache/spark

Regards,Tom Graves

Re: Spark 3.0 preview release 2?

2019-12-10 Thread Tom Graves
 +1 for another preview
Tom
On Monday, December 9, 2019, 12:32:29 AM CST, Xiao Li 
 wrote:  
 
 
I got many great feedbacks from the community about the recent 3.0 preview 
release. Since the last 3.0 preview release, we already have 353 commits 
[https://github.com/apache/spark/compare/v3.0.0-preview...master]. There are 
various important features and behavior changes we want the community to try 
before entering the official release candidates of Spark 3.0. 





Below is my selected items that are not part of the last 3.0 preview but 
already available in the upstream master branch: 


   
   - Support JDK 11 with Hadoop 2.7
   - Spark SQL will respect its own default format (i.e., parquet) when users 
do CREATE TABLE without USING or STORED AS clauses
   - Enable Parquet nested schema pruning and nested pruning on expressions by 
default
   - Add observable Metrics for Streaming queries
   - Column pruning through nondeterministic expressions
   - RecordBinaryComparator should check endianness when compared by long 
   - Improve parallelism for local shuffle reader in adaptive query execution
   - Upgrade Apache Arrow to version 0.15.1
   - Various interval-related SQL support
   - Add a mode to pin Python thread into JVM's
   - Provide option to clean up completed files in streaming query



I am wondering if we can have another preview release for Spark 3.0? This can 
help us find the design/API defects as early as possible and avoid the 
significant delay of the upcoming Spark 3.0 release




Also, any committer is willing to volunteer as the release manager of the next 
preview release of Spark 3.0, if we have such a release? 




Cheers,




Xiao
  

Re: Build customized resource manager

2019-11-08 Thread Tom Graves
 I don't know if it all works but some work was done to make cluster manager 
pluggable, see SPARK-13904.
Tom
On Wednesday, November 6, 2019, 07:22:59 PM CST, Klaus Ma 
 wrote:  
 
 Any suggestions?

- Klaus

On Mon, Nov 4, 2019 at 5:04 PM Klaus Ma  wrote:

Hi team,
AFAIK, we built k8s/yarn/mesos as resource manager; but I'd like to did some 
enhancement to them, e.g. integrate with Volcano in k8s. Is that possible to do 
that without fork the whole spark project? For example, enable customized 
resource manager with configuration, e.g. replace 
`org.apache.spark.deploy.k8s.submit.KubernetesClientApplication` with 
`MyK8SClient`, so I can only maintain the resource manager instead of the whole 
project.
-- Klaus
  

maven 3.6.1 removed from apache maven repo

2019-09-03 Thread Tom Graves
It looks like maven 3.6.1 was removed from the repo - see SPARK-28960.  It 
looks like they pushed 3.6.2,  but I don't see any release notes on the maven 
page for it 3.6.2
Seems like we had this happen before, can't remember if it was maven or 
something else, anyone remember or know if they are about to release 3.6.2?
Tom

Re: DISCUSS [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-08-26 Thread Tom Graves
 Bumping this up. I'm guessing people haven't had time to review, it would be 
great to get feedback on this.
Thanks,Tom
On Tuesday, August 6, 2019, 2:27:49 PM CDT, Tom Graves 
 wrote:  
 
 Hey everyone,
I have been working on coming up with a proposal for supporting stage level 
resource configuration and scheduling.  The basic idea is to allow the user to 
specify executor and task resource requirements for each stage to allow the 
user to control the resources required at a finer grain. One good example here 
is doing some ETL to preprocess your data in one stage and then feed that data 
into an ML algorithm (like tensorflow) that would run as a separate stage.  The 
ETL could need totally different resource requirements for the executors/tasks 
than the ML stage does.  
If you are interested please take a look at the SPIP and give me feedback.  The 
text for the SPIP is in the jira description:
https://issues.apache.org/jira/browse/SPARK-27495

I split the API and Design parts into a google doc that is linked to from the 
jira.
Thanks,Tom  

DISCUSS [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-08-06 Thread Tom Graves
Hey everyone,
I have been working on coming up with a proposal for supporting stage level 
resource configuration and scheduling.  The basic idea is to allow the user to 
specify executor and task resource requirements for each stage to allow the 
user to control the resources required at a finer grain. One good example here 
is doing some ETL to preprocess your data in one stage and then feed that data 
into an ML algorithm (like tensorflow) that would run as a separate stage.  The 
ETL could need totally different resource requirements for the executors/tasks 
than the ML stage does.  
If you are interested please take a look at the SPIP and give me feedback.  The 
text for the SPIP is in the jira description:
https://issues.apache.org/jira/browse/SPARK-27495

I split the API and Design parts into a google doc that is linked to from the 
jira.
Thanks,Tom

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-21 Thread Tom Graves
 +1 (binding)
I haven't looked at the low level api, but like the idea and approach to get it 
started.
Tom
On Tuesday, June 18, 2019, 10:40:34 PM CDT, Guo, Chenzhao 
 wrote:  
 
 #yiv1391836063 #yiv1391836063 -- _filtered #yiv1391836063 
{font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered #yiv1391836063 
{panose-1:2 11 6 9 7 2 5 8 2 4;} _filtered #yiv1391836063 {panose-1:2 4 5 3 5 4 
6 3 2 4;} _filtered #yiv1391836063 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 
3 2 4;} _filtered #yiv1391836063 {panose-1:2 1 6 0 3 1 1 1 1 1;} _filtered 
#yiv1391836063 {panose-1:2 11 6 9 7 2 5 8 2 4;}#yiv1391836063 #yiv1391836063 
p.yiv1391836063MsoNormal, #yiv1391836063 li.yiv1391836063MsoNormal, 
#yiv1391836063 div.yiv1391836063MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:New 
serif;}#yiv1391836063 a:link, #yiv1391836063 span.yiv1391836063MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1391836063 a:visited, #yiv1391836063 
span.yiv1391836063MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv1391836063 
span.yiv1391836063EmailStyle17 
{font-family:sans-serif;color:#1F497D;}#yiv1391836063 
.yiv1391836063MsoChpDefault {font-family:sans-serif;} _filtered #yiv1391836063 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv1391836063 div.yiv1391836063WordSection1 
{}#yiv1391836063 
Cool : )
 
  
 
+1 (non-binding)
 
  
 
Chenzhao
 
  
 
From: dhruve ashar [mailto:dhruveas...@gmail.com]
Sent: Wednesday, June 19, 2019 2:58 AM
To: John Zhuge 
Cc: Vinoo Ganesh ; Felix Cheung 
; Yinan Li ; 
rb...@netflix.com; Dongjoon Hyun ; Saisai Shao 
; Imran Rashid ; Ilan Filonenko 
; bo yang ; Matt Cheah 
; Spark Dev List ; Yifei Huang (PD) 
; Imran Rashid 
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
 
  
 
+1 (non-binding)
 
  
 
On Tue, Jun 18, 2019 at 12:12 PM John Zhuge  wrote:
 

+1 (non-binding)  Great work!
 
  
 
On Tue, Jun 18, 2019 at 6:22 AM Vinoo Ganesh  wrote:
 

+1 (non-binding).
 
 
 
Thanks for pushing this forward, Matt and Yifei.
 
 
 
From:Felix Cheung 
Date: Tuesday, June 18, 2019 at 00:01
To: Yinan Li , "rb...@netflix.com" 
Cc: Dongjoon Hyun , Saisai Shao 
, Imran Rashid , Ilan Filonenko 
, bo yang , Matt Cheah 
, Spark Dev List , "Yifei Huang 
(PD)" , Vinoo Ganesh , Imran Rashid 

Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API
 
 
 
+1
 
 
 
Glad to see the progress in this space - it’s been more than a year since the 
original discussion and effort started.
 
 
 
From: Yinan Li 
Sent: Monday, June 17, 2019 7:14:42 PM
To: rb...@netflix.com
Cc: Dongjoon Hyun; Saisai Shao; Imran Rashid; Ilan Filonenko; bo yang; Matt 
Cheah; Spark Dev List; Yifei Huang (PD); Vinoo Ganesh; Imran Rashid
Subject: Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API 
 
 
 
+1 (non-binding) 
 
 
 
On Mon, Jun 17, 2019 at 1:58 PM Ryan Blue  wrote:
 

+1 (non-binding)
 
 
 
On Sun, Jun 16, 2019 at 11:11 PM Dongjoon Hyun  wrote:
 

+1
 
 
 
Bests,
 
Dongjoon.
 
 
 
 
 
On Sun, Jun 16, 2019 at 9:41 PM Saisai Shao  wrote:
 

+1 (binding)
 
 
 
Thanks
 
Saisai
 
 
 
Imran Rashid 于2019年6月15日周六上午3:46写道:
 

+1 (binding)

I think this is a really important feature for spark.

First, there is already a lot of interest in alternative shuffle storage in the 
community.  There is already a lot of interest in alternative shuffle storage, 
from dynamic allocation in kubernetes, to even just improving stability in 
standard on-premise use of Spark.  However, they're often stuck doing this in 
forks of Spark, and in ways that are not maintainable (because they copy-paste 
many spark internals) or are incorrect (for not correctly handling speculative 
execution & stage retries).

Second, I think the specific proposal is good for finding the right balance 
between flexibility and too much complexity, to allow incremental improvements. 
 A lot of work has been put into this already to try to figure out which pieces 
are essential to make alternative shuffle storage implementations feasible.

Of course, that means it doesn't include everything imaginable; some things 
still aren't supported, and some will still choose to use the older 
ShuffleManager api to give total control over all of shuffle.  But we know 
there are a reasonable set of things which can be implemented behind the api as 
the first step, and it can continue to evolve.
 
 
 
On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko  wrote:
 

+1 (non-binding). This API is versatile and flexible enough to handle 
Bloomberg's internal use-cases. The ability for us to vary implementation 
strategies is quite appealing. It is also worth to note the minimal changes to 
Spark core in order to make it work. This is a very much needed addition within 
the Spark shuffle story. 
 
 
 
On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:
 

+1 This is great work, allowing plugin of different sort shuffle write/read 
implementation! Also great to see it retain the current Spark configuration 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Tom Graves
 Ok, I'm going to call this vote and send the result email. We had 9 +1's (4 
binding) and 1 +0 and no -1's.
Tom
On Monday, May 27, 2019, 3:25:14 PM CDT, Felix Cheung 
 wrote:  
 
 #yiv8731614492 html {background-color:transparent;}#yiv8731614492 body 
{color:#333;line-height:150%;margin:0;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-reference-expand 
{display:block;color:#999;padding:20px 0px;text-decoration:none;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-availability-container 
{max-width:500px;margin:auto;padding:12px 15px 15px 15px;border:1px solid 
#C7E0F4;border-radius:4px;}#yiv8731614492 #yiv8731614492 
.yiv8731614492ms-outlook-ios-availability-delete-button 
{width:25px;min-height:25px;background-size:25px 
25px;background-position:center;}#yiv8731614492 
#yiv8731614492ms-outlook-ios-main-container {margin:0 0 0 
0;margin-top:120;padding:8;}#yiv8731614492 
#yiv8731614492ms-outlook-ios-content-container 
{padding:0;padding-top:12;padding-bottom:20;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention 
{color:#333;background-color:#f1f1f1;border-radius:4px;padding:0 2px 0 
2px;text-decoration:none;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention-external 
{color:#ba8f0d;background-color:#fdf7e7;}#yiv8731614492 
.yiv8731614492ms-outlook-ios-mention-external-clear-design 
{color:#ba8f0d;background-color:#f1f1f1;}+1
I’d prefer to see more of the end goal and how that could be achieved (such as 
ETL or SPARK-24579). However given the rounds and months of discussions we have 
come down to just the public API.
If the community thinks a new set of public API is maintainable, I don’t see 
any problem with that.
From: Tom Graves 
Sent: Sunday, May 26, 2019 8:22:59 AM
To: hol...@pigscanfly.ca; Reynold Xin
Cc: Bobby Evans; DB Tsai; Dongjoon Hyun; Imran Rashid; Jason Lowe; Matei 
Zaharia; Thomas graves; Xiangrui Meng; Xiangrui Meng; dev
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support More feedback would be great, this has been open a long time 
though, let's extend til Wednesday the 29th and see where we are at.
Tom



Sent from Yahoo Mail on Android

On Sat, May 25, 2019 at 6:28 PM, Holden Karau wrote:Same 
I meant to catch up after kubecon but had some unexpected travels.
On Sat, May 25, 2019 at 10:56 PM Reynold Xin  wrote:

Can we push this to June 1st? I have been meaning to read it but unfortunately 
keeps traveling...
On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun  wrote:

+1
Thanks,Dongjoon.
On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:

+1 on exposing the APIs for columnar processing support.

I understand that the scope of this SPIP doesn't cover AI / ML
use-cases. But I saw a good performance gain when I converted data
from rows to columns to leverage on SIMD architectures in a POC ML
application.

With the exposed columnar processing support, I can imagine that the
heavy lifting parts of ML applications (such as computing the
objective functions) can be written as columnar expressions that
leverage on SIMD architectures to get a good speedup.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>
> It would allow for the columnar processing to be extended through the 
> shuffle.  So if I were doing say an FPGA accelerated extension it could 
> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as 
> input instead of a Row. The extended version of the ShuffleExchangeExec could 
> then do the partitioning on the incoming batch and instead of producing a 
> ShuffleRowRDD for the exchange they could produce something like a 
> ShuffleBatchRDD that would let the serializing and deserializing happen in a 
> column based format for a faster exchange, assuming that columnar processing 
> is also happening after the exchange. This is just like providing a columnar 
> version of any other catalyst operator, except in this case it is a bit more 
> complex of an operator.
>
> On Wed, May 15, 2019 at 12:15 PM Imran Rashid  
> wrote:
>>
>> sorry I am late to the discussion here -- the jira mentions using this 
>> extensions for dealing with shuffles, can you explain that part?  I don't 
>> see how you would use this to change shuffle behavior at all.
>>
>> On Tue, May 14, 2019 at 10:59 AM Thomas graves  wrote:
>>>
>>> Thanks for replying, I'll extend the vote til May 26th to allow your
>>> and other people feedback who haven't had time to look at it.
>>>
>>> Tom
>>>
>>> On Mon, May 13, 2019 at 4:43 PM Holden Karau  wrote:
>>> >
>>> > I’d like to ask this vote period to be extended, I’m interested but I 
>>> > don’t have the cycles to review it in detail and make an informed vote 
>>> > until the 25th.
>>>

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
 Ok, I'm cancelling the vote for now then and we will make some updates to the 
SPIP to try to clarify.
Tom
On Monday, April 22, 2019, 1:07:25 PM CDT, Reynold Xin 
 wrote:  
 
 "if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans  wrote:


Xiangrui Meng,

I provided some examples in the original discussion thread.

https:/ / lists. apache. org/ thread. html/ 
f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@ %3Cdev. spark. 
apache. org%3E

But the concrete use case that we have is GPU accelerated ETL on 
Spark.Primarily as data preparation and feature engineering for ML tools 
likeXGBoost, which by the way exposes a Spark specific scala API, not just 
apython one. We built a proof of concept and saw decent performance 
gains.Enough gains to more than pay for the added cost of a GPU, with 
thepotential for even better performance in the future. With that proof 
ofconcept, we were able to make all of the processing columnar end-to-end 
formany queries so there really wasn't any data conversion costs to 
overcome,but we did want the design flexible enough to include a 
cost-basedoptimizer. \

It looks like there is some confusion around this SPIP especially in how 
itrelates to features in other SPIPs around data exchange between 
differentsystems. I didn't want to update the text of this SPIP while it was 
underan active vote, but if others think it would be helpful, we can cancel 
thisvote, update the SPIP to clarify exactly what I am proposing, and 
thenrestart the vote after we have gotten more agreement on what APIs should 
beexposed.

Thanks,

Bobby

On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng  wrote:


Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. Ithink 
the SPIP should list a concrete ETL use case (from POC?) that canbenefit from 
this *public Java/Scala API, *does *vectorization*, andsignificantly *boosts 
the performance *even with data conversion overhead.

The current mid-term success (Pandas UDF) doesn't match the purpose ofSPIP and 
it can be done without exposing any public APIs.

Depending how much benefit it brings, we might agree that a publicJava/Scala 
API is needed. Then we might want to step slightly into how. Isaw three options 
mentioned in the JIRA and discussion threads:

1. Expose `Array[Byte]` in Arrow format. Let user decode it using an 
Arrowlibrary.
2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also usedby 
Spark internals. It makes us hard to change Spark internals in thefuture.
4. Expose something like `SparkRecordBatch` that is Arrow-compatible 
andmaintain conversion between internal `ColumnarBatch` and
`SparkRecordBatch`. It might cause conversion overhead in the future if 
ourinternal becomes different from Arrow.

Note that both 3 and 4 will make many APIs public to be Arrow compatible.So we 
should really give concrete ETL cases to prove that it is importantfor us to do 
so.

On Mon, Apr 22, 2019 at 8:27 AM Tom Graves  wrote:


Based on there is still discussion and Spark Summit is this week, I'mgoing to 
extend the vote til Friday the 26th.

Tom
On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans wrote:

Yes, it is technically possible for the layout to change. No, it is notgoing to 
happen. It is already baked into several different officiallibraries which are 
widely used, not just for holding and processing thedata, but also for transfer 
of the data between the variousimplementations. There would have to be a really 
serious reason to forcean incompatible change at this point. So in the worst 
case, we can versionthe layout and bake that into the API that exposes the 
internal layout ofthe data. That way code that wants to program against a JAVA 
API can do sousing the API that Spark provides, those who want to interface 
withsomething that expects the data in arrow format will already have to 
knowwhat version of the format it was programmed against and in the worst 
caseif the layout does change we can support the new layout if needed.

On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler  wrote:

The Arrow data format is not yet stable, meaning there are no guaranteeson 
backwards/forwards compatibility. Once version 1.0 is released, it willhave 
those guarantees but it's hard to say when that will be. The remainingwork to 
get there can be seen at
https:/ / cwiki. apache. org/ confluence/ display/ ARROW/ Columnar+Format+1. 
0+Milestone.So yes, it is a risk that exposing Spark data as Arrow could cause 
an issueif handled by a different version that is not 

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Tom Graves
 not that hard to keep a storage format 
> backward-compatible: just document the format and extend it only in ways that 
> don’t break the meaning of old data (for example, add new version numbers or 
> field types that are read in a different way). It’s a bit harder for a Java 
> API, but maybe Spark could just expose byte arrays directly and work on those 
> if the API is not guaranteed to stay stable (that is, we’d still use our own 
> classes to manipulate the data internally, and end users could use the Arrow 
> library if they want it).
> 
> Matei
> 
> > On Apr 20, 2019, at 8:38 AM, Bobby Evans  wrote:
> > 
> > I think you misunderstood the point of this SPIP. I responded to your 
> > comments in the SPIP JIRA.
> > 
> > On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng  wrote:
> > I posted my comment in the JIRA. Main concerns here:
> > 
> > 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 
> > release someday.
> > 2. ML/DL systems that can benefits from columnar format are mostly in 
> > Python.
> > 3. Simple operations, though benefits vectorization, might not be worth the 
> > data exchange overhead.
> > 
> > So would an improved Pandas UDF API would be good enough? For example, 
> > SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > 
> > Sorry that I should join the discussion earlier! Hope it is not too late:)
> > 
> > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > +1 (non-binding) for better columnar data processing support.
> > 
> >  
> > 
> > From: Jules Damji  
> > Sent: Friday, April 19, 2019 12:21 PM
> > To: Bryan Cutler 
> > Cc: Dev 
> > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
> > Processing Support
> > 
> >  
> > 
> > + (non-binding)
> > 
> > Sent from my iPhone
> > 
> > Pardon the dumb thumb typos :)
> > 
> > 
> > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > 
> > +1 (non-binding)
> > 
> >  
> > 
> > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > 
> > +1 (non-binding).  Looking forward to seeing better support for processing 
> > columnar data.
> > 
> >  
> > 
> > Jason
> > 
> >  
> > 
> > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves  
> > wrote:
> > 
> > Hi everyone,
> > 
> >  
> > 
> > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
> > Columnar Processing Support.  The proposal is to extend the support to 
> > allow for more columnar processing.
> > 
> >  
> > 
> > You can find the full proposal in the jira at: 
> > https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
> > thread in the dev mailing list.
> > 
> >  
> > 
> > Please vote as early as you can, I will leave the vote open until next 
> > Monday (the 22nd), 2pm CST to give people plenty of time.
> > 
> >  
> > 
> > [ ] +1: Accept the proposal as an official SPIP
> > 
> > [ ] +0
> > 
> > [ ] -1: I don't think this is a good idea because ...
> > 
> >  
> > 
> >  
> > 
> > Thanks!
> > 
> > Tom Graves
> > 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Tom Graves
Hi everyone,
I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
Columnar Processing Support.  The proposal is to extend the support to allow 
for more columnar processing.
You can find the full proposal in the jira at: 
https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
thread in the dev mailing list.
Please vote as early as you can, I will leave the vote open until next Monday 
(the 22nd), 2pm CST to give people plenty of time.
[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this 
is a good idea because ...

Thanks!Tom Graves

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Tom Graves
 +1 on the updated SPIP.
Tom
On Monday, March 18, 2019, 12:56:22 PM CDT, Xingbo Jiang 
 wrote:  
 
 Hi all,
I updated the SPIP doc and stories, I hope it now contains clear scope of the 
changes and enough details for SPIP vote.Please review the updated docs, thanks!
Xiangrui Meng  于2019年3月6日周三 上午8:35写道:

How about letting Xingbo make a major revision to the SPIP doc to make it clear 
what proposed are? I like Felix's suggestion to switch to the new Heilmeier 
template, which helps clarify what are proposed and what are not. Then let's 
review the new SPIP and resume the vote.
On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:

OK, I suppose then we are getting bogged down into what a vote on an SPIP means 
then anyway, which I guess we can set aside for now.  With the level of detail 
in this proposal, I feel like there is a reasonable chance I'd still -1 the 
design or implementation.
And the other thing you're implicitly asking the community for is to prioritize 
this feature for continued review and maintenance.  There is already work to be 
done in things like making barrier mode support dynamic allocation 
(SPARK-24942), bugs in failure handling (eg. SPARK-25250), and general 
efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm very 
concerned about getting spread too thin.


But if this is really just a vote on (1) is better gpu support important for 
spark, in some form, in some release? and (2) is it *possible* to do this in a 
safe way?  then I will vote +0.
On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:

 So to me most of the questions here are implementation/design questions, I've 
had this issue in the past with SPIP's where I expected to have more high level 
design details but was basically told that belongs in the design jira follow 
on. This makes me think we need to revisit what a SPIP really need to contain, 
which should be done in a separate thread.  Note personally I would be for 
having more high level details in it.But the way I read our documentation on a 
SPIP right now that detail is all optional, now maybe we could argue its based 
on what reviewers request, but really perhaps we should make the wording of 
that more required.  thoughts?  We should probably separate that discussion if 
people want to talk about that.
For this SPIP in particular the reason I +1 it is because it came down to 2 
questions:
1) do I think spark should support this -> my answer is yes, I think this would 
improve spark, users have been requesting both better GPUs support and support 
for controlling container requests at a finer granularity for a while.  If 
spark doesn't support this then users may go to something else, so I think it 
we should support it
2) do I think its possible to design and implement it without causing large 
instabilities?   My opinion here again is yes. I agree with Imran and others 
that the scheduler piece needs to be looked at very closely as we have had a 
lot of issues there and that is why I was asking for more details in the design 
jira:  https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe its 
possible to do.
If others have reservations on similar questions then I think we should resolve 
here or take the discussion of what a SPIP is to a different thread and then 
come back to this, thoughts?    
Note there is a high level design for at least the core piece, which is what 
people seem concerned with, already so including it in the SPIP should be 
straight forward.
Tom
On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid 
 wrote:  
 
 On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung  wrote:
IMO upfront allocation is less useful. Specifically too expensive for large 
jobs.

This is also an API/design discussion.

I agree with Felix -- this is more than just an API question.  It has a huge 
impact on the complexity of what you're proposing.  You might be proposing big 
changes to a core and brittle part of spark, which is already short of experts.
I don't see any value in having a vote on "does feature X sound cool?"  We have 
to evaluate the potential benefit against the risks the feature brings and the 
continued maintenance cost.  We don't need super low-level details, but we have 
to a sketch of the design to be able to make that tradeoff.  


  

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Tom Graves
 While I agree with you that it would be ideal to have the task level resources 
and do a deeper redesign for the scheduler, I think that can be a separate 
enhancement like was discussed earlier in the thread. That feature is useful 
without GPU's.  I do realize that they overlap some but I think the changes for 
this will be minimal to the scheduler, follow existing conventions, and it is 
an improvement over what we have now. I know many users will be happy to have 
this even without the task level scheduling as many of the conventions used now 
to scheduler gpus can easily be broken by one bad user.     I think from the 
user point of view this gives many users an improvement and we can extend it 
later to cover more use cases. 
TomOn Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra 
 wrote:  
 
 I understand the application-level, static, global nature of 
spark.task.accelerator.gpu.count and its similarity to the existing 
spark.task.cpus, but to me this feels like extending a weakness of Spark's 
scheduler, not building on its strengths. That is because I consider binding 
the number of cores for each task to an application configuration to be far 
from optimal. This is already far from the desired behavior when an application 
is running a wide range of jobs (as in a generic job-runner style of Spark 
application), some of which require or can benefit from multi-core tasks, 
others of which will just waste the extra cores allocated to their tasks. 
Ideally, the number of cores allocated to tasks would get pushed to an even 
finer granularity that jobs, and instead being a per-stage property.
Now, of course, making allocation of general-purpose cores and domain-specific 
resources work in this finer-grained fashion is a lot more work than just 
trying to extend the existing resource allocation mechanisms to handle 
domain-specific resources, but it does feel to me like we should at least be 
considering doing that deeper redesign.  
On Thu, Mar 21, 2019 at 7:33 AM Tom Graves  wrote:

 Tthe proposal here is that all your resources are static and the gpu per task 
config is global per application, meaning you ask for a certain amount memory, 
cpu, GPUs for every executor up front just like you do today and every executor 
you get is that size.  This means that both static or dynamic allocation still 
work without explicitly adding more logic at this point. Since the config for 
gpu per task is global it means every task you want will need a certain ratio 
of cpu to gpu.  Since that is a global you can't really have the scenario you 
mentioned, all tasks are assuming to need GPU.  For instance. I request 5 
cores, 2 GPUs, set 1 gpu per task for each executor.  That means that I could 
only run 2 tasks and 3 cores would be wasted.  The stage/task level 
configuration of resources was removed and is something we can do in a separate 
SPIP. We thought erroring would make it more obvious to the user.  We could 
change this to a warning if everyone thinks that is better but I personally 
like the error until we can implement the per lower level per stage 
configuration. 
Tom
On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido 
 wrote:  
 
 Thanks for this SPIP.I cannot comment on the docs, but just wanted to 
highlight one thing. In page 5 of the SPIP, when we talk about DRA, I see:
"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task 
requires 1 CPU and 1GPU, then we shall throw an error on application start 
because we shall always have at least 2 idle CPUs per executor"
I am not sure this is a correct behavior. We might have tasks requiring only 
CPU running in parallel as well, hence that may make sense. I'd rather emit a 
WARN or something similar. Anyway we just said we will keep GPU scheduling on 
task level out of scope for the moment, right?
Thanks,Marco
Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng  ha 
scritto:

Steve, the initial work would focus on GPUs, but we will keep the interfaces 
general to support other accelerators in the future. This was mentioned in the 
SPIP and draft design. 
Imran, you should have comment permission now. Thanks for making a pass! I 
don't think the proposed 3.0 features should block Spark 3.0 release either. It 
is just an estimate of what we could deliver. I will update the doc to make it 
clear.
Felix, it would be great if you can review the updated docs and let us know 
your feedback.
** How about setting a tentative vote closing time to next Tue (Mar 26)?
On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid  wrote:

Thanks for sending the updated docs.  Can you please give everyone the ability 
to comment?  I have some comments, but overall I think this is a good proposal 
and addresses my prior concerns.
My only real concern is that I notice some mention of "must dos" for spark 3.0. 
 I don't want to make any commitment to holding spark 3.0 for parts of this, I 
think that is an entirely separat

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Tom Graves
 Tthe proposal here is that all your resources are static and the gpu per task 
config is global per application, meaning you ask for a certain amount memory, 
cpu, GPUs for every executor up front just like you do today and every executor 
you get is that size.  This means that both static or dynamic allocation still 
work without explicitly adding more logic at this point. Since the config for 
gpu per task is global it means every task you want will need a certain ratio 
of cpu to gpu.  Since that is a global you can't really have the scenario you 
mentioned, all tasks are assuming to need GPU.  For instance. I request 5 
cores, 2 GPUs, set 1 gpu per task for each executor.  That means that I could 
only run 2 tasks and 3 cores would be wasted.  The stage/task level 
configuration of resources was removed and is something we can do in a separate 
SPIP. We thought erroring would make it more obvious to the user.  We could 
change this to a warning if everyone thinks that is better but I personally 
like the error until we can implement the per lower level per stage 
configuration. 
Tom
On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido 
 wrote:  
 
 Thanks for this SPIP.I cannot comment on the docs, but just wanted to 
highlight one thing. In page 5 of the SPIP, when we talk about DRA, I see:
"For instance, if each executor consists 4 CPUs and 2 GPUs, and each task 
requires 1 CPU and 1GPU, then we shall throw an error on application start 
because we shall always have at least 2 idle CPUs per executor"
I am not sure this is a correct behavior. We might have tasks requiring only 
CPU running in parallel as well, hence that may make sense. I'd rather emit a 
WARN or something similar. Anyway we just said we will keep GPU scheduling on 
task level out of scope for the moment, right?
Thanks,Marco
Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng  ha 
scritto:

Steve, the initial work would focus on GPUs, but we will keep the interfaces 
general to support other accelerators in the future. This was mentioned in the 
SPIP and draft design. 
Imran, you should have comment permission now. Thanks for making a pass! I 
don't think the proposed 3.0 features should block Spark 3.0 release either. It 
is just an estimate of what we could deliver. I will update the doc to make it 
clear.
Felix, it would be great if you can review the updated docs and let us know 
your feedback.
** How about setting a tentative vote closing time to next Tue (Mar 26)?
On Wed, Mar 20, 2019 at 11:01 AM Imran Rashid  wrote:

Thanks for sending the updated docs.  Can you please give everyone the ability 
to comment?  I have some comments, but overall I think this is a good proposal 
and addresses my prior concerns.
My only real concern is that I notice some mention of "must dos" for spark 3.0. 
 I don't want to make any commitment to holding spark 3.0 for parts of this, I 
think that is an entirely separate decision.  However I'm guessing this is just 
a minor wording issue, and you really mean that's a minimal set of features you 
are aiming for, which is reasonable.
On Mon, Mar 18, 2019 at 12:56 PM Xingbo Jiang  wrote:

Hi all,
I updated the SPIP doc and stories, I hope it now contains clear scope of the 
changes and enough details for SPIP vote.Please review the updated docs, thanks!
Xiangrui Meng  于2019年3月6日周三 上午8:35写道:

How about letting Xingbo make a major revision to the SPIP doc to make it clear 
what proposed are? I like Felix's suggestion to switch to the new Heilmeier 
template, which helps clarify what are proposed and what are not. Then let's 
review the new SPIP and resume the vote.
On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:

OK, I suppose then we are getting bogged down into what a vote on an SPIP means 
then anyway, which I guess we can set aside for now.  With the level of detail 
in this proposal, I feel like there is a reasonable chance I'd still -1 the 
design or implementation.
And the other thing you're implicitly asking the community for is to prioritize 
this feature for continued review and maintenance.  There is already work to be 
done in things like making barrier mode support dynamic allocation 
(SPARK-24942), bugs in failure handling (eg. SPARK-25250), and general 
efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm very 
concerned about getting spread too thin.


But if this is really just a vote on (1) is better gpu support important for 
spark, in some form, in some release? and (2) is it *possible* to do this in a 
safe way?  then I will vote +0.
On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:

 So to me most of the questions here are implementation/design questions, I've 
had this issue in the past with SPIP's where I expected to have more high level 
design details but was basically told that belongs in the design jira follow 
on. This makes me think we need to revisit what a SPIP really need to contain, 
which should be done in a se

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-05 Thread Tom Graves
 So to me most of the questions here are implementation/design questions, I've 
had this issue in the past with SPIP's where I expected to have more high level 
design details but was basically told that belongs in the design jira follow 
on. This makes me think we need to revisit what a SPIP really need to contain, 
which should be done in a separate thread.  Note personally I would be for 
having more high level details in it.But the way I read our documentation on a 
SPIP right now that detail is all optional, now maybe we could argue its based 
on what reviewers request, but really perhaps we should make the wording of 
that more required.  thoughts?  We should probably separate that discussion if 
people want to talk about that.
For this SPIP in particular the reason I +1 it is because it came down to 2 
questions:
1) do I think spark should support this -> my answer is yes, I think this would 
improve spark, users have been requesting both better GPUs support and support 
for controlling container requests at a finer granularity for a while.  If 
spark doesn't support this then users may go to something else, so I think it 
we should support it
2) do I think its possible to design and implement it without causing large 
instabilities?   My opinion here again is yes. I agree with Imran and others 
that the scheduler piece needs to be looked at very closely as we have had a 
lot of issues there and that is why I was asking for more details in the design 
jira:  https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe its 
possible to do.
If others have reservations on similar questions then I think we should resolve 
here or take the discussion of what a SPIP is to a different thread and then 
come back to this, thoughts?    
Note there is a high level design for at least the core piece, which is what 
people seem concerned with, already so including it in the SPIP should be 
straight forward.
Tom
On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid 
 wrote:  
 
 On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung  wrote:
IMO upfront allocation is less useful. Specifically too expensive for large 
jobs.

This is also an API/design discussion.

I agree with Felix -- this is more than just an API question.  It has a huge 
impact on the complexity of what you're proposing.  You might be proposing big 
changes to a core and brittle part of spark, which is already short of experts.
I don't see any value in having a vote on "does feature X sound cool?"  We have 
to evaluate the potential benefit against the risks the feature brings and the 
continued maintenance cost.  We don't need super low-level details, but we have 
to a sketch of the design to be able to make that tradeoff.  

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Tom Graves
 +1 for the SPIP.
Tom
On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang 
 wrote:  
 
 Hi all,
I want to call for a vote of SPARK-24615. It improves Spark by making it aware 
of GPUs exposed by cluster managers, and hence Spark can match GPU resources 
with user task requests properly. The proposal and production doc was made 
available on dev@ to collect input. Your can also find a design sketch at 
SPARK-27005.
The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical 
reasons.

Thank you!
Xingbo  

Re: Jenkins commands?

2019-02-07 Thread Tom Graves
 Thanks, that is exactly what I was looking for.
Tom
On Wednesday, February 6, 2019, 10:50:14 PM CST, shane knapp 
 wrote:  
 
 the PRB executes the following scripts:./dev/run-tests-jenkins
./build/sbt unsafe/test

SBT QA tests:./dev/run-tests 
maven QA tests:ZINC_PORT=$(python -S -c "import random; print 
random.randrange(3030,4030)")MVN="build/mvn --force -DzincPort=$ZINC_PORT"
$MVN \    -DskipTests \    -P"hadoop-2.7" \    -Pyarn \    -Phive \    
-Phive-thriftserver \    -Pkinesis-asl \    -Pmesos \    clean package$MVN \
    -P"hadoop-2.7" \    -Pyarn \    -Phive \    -Phive-thriftserver \    
-Pkinesis-asl \    -Pmesos \    --fail-at-end \    test 

there some some specific rise/amp-lab variables involved (grep -r AMPLAB 
spark/*) for the build system, but this should cover it.
On Wed, Feb 6, 2019 at 3:55 PM Tom Graves  wrote:

I'm curious if we have it documented anywhere or if there is a good place to 
look, what exact commands Spark runs in the pull request builds and the QA 
builds?  

Thanks,Tom


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
  

Jenkins commands?

2019-02-06 Thread Tom Graves
I'm curious if we have it documented anywhere or if there is a good place to 
look, what exact commands Spark runs in the pull request builds and the QA 
builds?  

Thanks,Tom

Re: [Structured Streaming] Kafka group.id is fixed

2018-11-19 Thread Tom Graves
 This makes sense to me and was going to propose something similar in order to 
be able to use the kafka acls more effectively as well, can you file a jira for 
it?
Tom
On Friday, November 9, 2018, 2:26:12 AM CST, Anastasios Zouzias 
 wrote:  
 
 Hi all,
I run in the following situation with Spark Structure Streaming (SS) using 
Kafka.
In a project that I work on, there is already a secured Kafka setup where ops 
can issue an SSL certificate per "group.id", which should be predefined (or 
hopefully its prefix to be predefined).
On the other hand, Spark SS fixes the group.id to 
val uniqueGroupId = 
s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
see, i.e.,

https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124
I guess Spark developers had a good reason to fix it, but is it possible to 
make configurable the prefix of the above uniqueGroupId ("spark-kafka-source")? 
If so, I could prepare a PR on it.
The rational is that we do not want all spark-jobs to use the same certificate 
on group-ids of the form (spark-kafka-source-*).

Best regards,Anastasios Zouzias  

Re: Test and support only LTS JDK release?

2018-11-07 Thread Tom Graves
 +1 seems reasonable at this point.
Tom
On Tuesday, November 6, 2018, 1:24:16 PM CST, DB Tsai  
wrote:  
 
 Given Oracle's new 6-month release model, I feel the only realistic option is 
to only test and support JDK such as JDK 11 LTS and future LTS release. I would 
like to have a discussion on this in Spark community.  
Thanks,

DB Tsai  |  Siri Open Source Technologies [not a contribution]  |   Apple, Inc

  

Re: What's a blocker?

2018-10-25 Thread Tom Graves
Ignoring everything else in this thread to put sharper point on one issue. In 
the pr multiple people referred to it's not a blocker based on it was also a 
bug/dropped feature in the previous release (note one was phrased slightly 
different as it was stated not a regression, which I read as not a regression 
from the previous feature release).  My thoughts on this are if multiple people 
think this then others may as well so I think we need a discuss thread on it.
My reasons for disagreeing with that are it specifically goes against our 
documented versioning policy.  The jira claims we essentially broke proper 
support for hive udafs, we specifically state in our docs we support hive 
udafs, i consider that an api, our versioning docs say we wont break api 
compatibility in feature releases. It shouldn't matter if that was 1 feature 
release ago or 10, until we do a major release we shouldn't break or drop that 
compatibility.
So we should not be using that as a reason to decide if a jira is a blocker or 
not.

Tom 
 
  On Thu, Oct 25, 2018 at 9:39 AM, Sean Owen wrote:   What 
does "PMC members aren't saying its a block for reasons other then the actual 
impact the jira has" mean that isn't already widely agreed? Likewise 
"Committers and PMC members should not be saying its not a blocker because they 
personally or their company doesn't care about this feature or api". It sounds 
like insinuation, and I'd rather make it explicit -- call out the bad actions 
-- or keep it to observable technical issues.
Likewise one could say there's a problem just because A thinks X should be a 
blocker and B disagrees. I see no bad faith, process problem, or obvious 
errors. Do you? I see disagreement, and it's tempting to suspect motives. I 
have seen what I think are actual bad-faith decisions in the past in this 
project, too. I don't see it here though and want to stick to 'now'.

(Aside: the implication is that those representing vendors are steam-rolling a 
release. Actually, the cynical incentives cut the other way here. Blessing the 
latest changes as OSS Apache Spark is predominantly beneficial to users of OSS, 
not distros. In fact, it forces distros to make changes. And broadly, vendors 
have much more accountability for quality of releases, because they're paid to.)

I'm still not sure what specifically the objection is to what here? I 
understand a lot is in flight and nobody agrees with every decision made, but, 
what else is new? Concretely: the release is held again to fix a few issues, in 
the end. For the map_filter issue, that seems like the right call, and there 
are a few other important issues that could be quickly fixed too. All is well 
there, yes?
This has surfaced some implicit reasoning about releases that we could make 
explicit, like:
(Sure, if you want to write down things like, release blockers should be 
decided in the interests of the project by the PMC, OK)
We have a time-based release schedule, so time matters. There is an opportunity 
cost to not releasing. The bar for blockers goes up over time.
Not all regressions are blockers. Would you hold a release over a trivial 
regression? but then which must or should block? There's no objective answer, 
but a reasonable rule is: non-trivial regressions from minor release x.y to 
x.{y+1} block releases. Regressions from x.{y-1} to x.{y+1} should, but not 
necessarily, block the release. We try hard to avoid regressions in x.y.0 
releases because these are generally consumed by aggressive upgraders, on 
x.{y-1}.z now. If a bug exists in x.{y-1}, they're not affected or worked 
around it. The cautious upgrader goes from maybe x.{y-2}.z to x.y.1 later. 
They're affected, but not before, maybe, a maintenance release. A crude 
argument, and it's not an argument that regressions are OK. It's an argument 
that 'old' regressions matter less. And maybe it's reasonable to draw the 
"must" vs "should" line between them.



On Thu, Oct 25, 2018 at 8:51 AM Tom Graves  wrote:

 So just to clarify a few things in case people didn't read the entire thread 
in the PR, the discussion is what is the criteria for a blocker and really my 
concerns are what people are using as criteria for not marking a jira as a 
blocker.
The only thing we have documented to mark a jira as a blocker is for 
correctness issues: http://spark.apache.org/contributing.html.  And really I 
think that is initially mark it as a blocker to bring attention to it.The final 
decision as to whether something is a blocker is up to the PMC who votes on 
whether a release passes.  I think it would be impossible to properly define 
what a blocker is with strict rules.
Personally from this thread I would like to make sure committers and PMC 
members aren't saying its a block for reasons other then the actual impact the 
jira has and if its at all in question it should be brought to the PMC's 
attention for a vote.  I agree with others that if its

Re: What's a blocker?

2018-10-25 Thread Tom Graves
 So just to clarify a few things in case people didn't read the entire thread 
in the PR, the discussion is what is the criteria for a blocker and really my 
concerns are what people are using as criteria for not marking a jira as a 
blocker.
The only thing we have documented to mark a jira as a blocker is for 
correctness issues: http://spark.apache.org/contributing.html.  And really I 
think that is initially mark it as a blocker to bring attention to it.The final 
decision as to whether something is a blocker is up to the PMC who votes on 
whether a release passes.  I think it would be impossible to properly define 
what a blocker is with strict rules.
Personally from this thread I would like to make sure committers and PMC 
members aren't saying its a block for reasons other then the actual impact the 
jira has and if its at all in question it should be brought to the PMC's 
attention for a vote.  I agree with others that if its during an RC it should 
be talked about on the RC thread.
A few specific things that were said that I disagree with are:   - its not a 
blocker because it was also an issue in the last release (meaning feature 
release).  ie the bug was introduced in 2.2 and now we are doing 2.4 so its 
automatically not a blocker.  This to me is just wrong.  Lots of things are not 
found immediately, or aren't reported immediately.   Now I do believe the 
timeframe its been in there does affect the decision on the impact but just 
making the decision on this to me is to strict.    - Committers and PMC members 
should not be saying its not a blocker because they personally or their company 
doesn't care about this feature or api, or state that the Spark project as a 
whole doesn't care about this feature unless that was specifically voted on at 
the project level. They need to follow the api compatibility we have 
documented. This is really a broader issue then just marking a jira, it goes to 
anything checked in and perhaps need to be a separate thread.

For the verbiage of what a regression is, it seems like that should be defined 
by our versioning documents. It states what we do in maintenance, feature, and 
major releases (http://spark.apache.org/versioning-policy.html), if its not 
defined by that we probably need to clarify.   There was a good example we 
might want to clarify about things like scala or java compatibility in feature 
releases.  
Obviously this is my opinion and its here for everyone to discuss and come to a 
consensus on.   
Tom
On Wednesday, October 24, 2018, 2:09:49 PM CDT, Sean Owen 
 wrote:  
 
 Shifting this to dev@. See the PR https://github.com/apache/spark/pull/22144 
for more context.
There will be no objective, complete definition of blocker, or even regression 
or correctness issue. Many cases are clear, some are not. We can draw up more 
guidelines, and feel free to open PRs against the 'contributing' doc. But in 
general these are the same consensus-driven decisions we negotiate all the time.
What isn't said that should be is that there is a cost to not releasing. Keep 
in mind we have, also, decided on a 'release train' cadence. That does properly 
change the calculus about what's a blocker; the right decision could change 
within even a week.

I wouldn't mind some verbiage around what a regression is. Since the last minor 
release?
We can VOTE on anything we like, but we already VOTE on the release. Weirdly, 
technically, the release vote criteria is simple majority, FWIW: 
http://www.apache.org/legal/release-policy.html#release-approval 
Yes, actually, it is only the PMC's votes that literally matter. Those votes 
are, surely, based on input from others too. But that is actually working as 
intended.

Let's understand statements like "X is not a blocker" to mean "I don't think 
that X is a blocker". Interpretations not proclamations, backed up by reasons, 
not all of which are appeals to policy and precedent.
I find it hard to argue about these in the abstract, because I believe it's 
already widely agreed, and written down in ASF policy, that nobody makes 
decisions unilaterally. Done, yes. 
Practically speaking, the urgent issue is the 2.4 release. I don't see process 
failures here that need fixing or debate. I do think those outstanding issues 
merit technical discussion. The outcome will be a tradeoff of some subjective 
issues, not read off of a policy sheet, and will entail tradeoffs. Let's speak 
freely about those technical issues and try to find the consensus position.

On Wed, Oct 24, 2018 at 12:21 PM Mark Hamstra  wrote:


Thanks @tgravescs for your latest posts -- they've saved me from posting 
something similar in many respects but more strongly worded.

What is bothering me (not just in the discussion of this PR, but more broadly) 
is that we have individuals making declarative statements about whether 
something can or can't block a release, or that something "is not that 
important to Spark at this point", etc. -- things for which 

Re: [DISCUSS] Handling correctness/data loss jiras

2018-08-17 Thread Tom Graves
 Since we haven't heard any objections to this, the documentation has been 
updated (Thanks to Sean).
All devs please make sure to re-read: http://spark.apache.org/contributing.html 
.
Note the set of labels used in Jira has been documented and correctness or data 
loss issues should be marked as blocker by default.  There is also a label to 
mark the jira as having something needing to go into the release-notes.

Tom
On Tuesday, August 14, 2018, 3:32:27 PM CDT, Imran Rashid 
 wrote:  
 
 +1 on what we should do.

On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves  
wrote:

 
> I mean, what are concrete steps beyond saying this is a problem? That's the 
>important thing to discuss.
Sorry I'm a bit confused by your statement but also think I agree.  I started 
this thread for this reason. I pointed out that I thought it was a problem and 
also brought up things I thought we could do to help fix it.  
Maybe I wasn't clear in the first email, the list of things I had were 
proposals on what we do for a jira that is for a correctness/data loss issue. 
Its the committers and developers that are involved in this though so if people 
don't agree or aren't going to do them, then it doesn't work.
Just to restate what I think we should do:
- label any correctness/data loss jira with "correctness"- jira should be 
marked as a blocker by default if someone suspects a corruption/loss issue- 
Make sure the description is clear about when it occurs and impact to the user. 
  - ensure its back ported to all active branches- See if we can have a 
separate section in the release notes for these
The last one I guess is more a one time thing that i can file a jira for.  The 
first 4 would be done for each jira filed.
I'm proposing we do these things and as such if people agree we would also 
document those things in the committers or developers guide and send email to 
the list. 
 
TomOn Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen 
 wrote:  
 
 Generally: if someone thinks correctness fix X should be backported further, 
I'd say just do it, if it's to an active release branch (see below). Anything 
that important has to outweigh most any other concern, like behavior changes.

On Mon, Aug 13, 2018 at 11:08 AM Tom Graves  wrote:
I'm not really sure what you mean by this, this proposal is to introduce a 
process for this type of issue so its at least brought to peoples attention. We 
can't do anything to make people work on certain things.  If they aren't raised 
as important issues then its really easy to miss these things.  If its a 
blocker we should also not be doing any new releases without a fix for it which 
may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the 
important thing to discuss.
There's a good one here: let's say anything that's likely to be a correctness 
or data loss issue should automatically be labeled 'correctness' as such and 
set to Blocker. 
That can go into the how-to-contribute manual in the docs and in a note to 
dev@.  
I agree it would be good for us to make it more official about which branches 
are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x 
since we recently did releases of all of these.  Since 2.4 will be coming out 
we should definitely think about stop maintaining 2.1.x.  Perhaps we need a 
table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least 
establish a policy:
Minor release branches will, generally, be maintained with bug fixes releases 
for a period of 18 months. For example, branch 2.1.x is no longer considered 
maintained as of July 2018, 18 months after the release of 2.1.0 in December 
2106.
This gives us -- and more importantly users -- some understanding of what to 
expect for backporting and fixes.

I am going to revive the thread about adding PMC / committers as it's overdue. 
That may not do much, but, more hands to do more work ought to possibly free up 
people to focus on deeper harder issues.  

  

Re: [DISCUSS] Handling correctness/data loss jiras

2018-08-13 Thread Tom Graves
 
> I mean, what are concrete steps beyond saying this is a problem? That's the 
>important thing to discuss.
Sorry I'm a bit confused by your statement but also think I agree.  I started 
this thread for this reason. I pointed out that I thought it was a problem and 
also brought up things I thought we could do to help fix it.  
Maybe I wasn't clear in the first email, the list of things I had were 
proposals on what we do for a jira that is for a correctness/data loss issue. 
Its the committers and developers that are involved in this though so if people 
don't agree or aren't going to do them, then it doesn't work.
Just to restate what I think we should do:
- label any correctness/data loss jira with "correctness"- jira should be 
marked as a blocker by default if someone suspects a corruption/loss issue- 
Make sure the description is clear about when it occurs and impact to the user. 
  - ensure its back ported to all active branches- See if we can have a 
separate section in the release notes for these
The last one I guess is more a one time thing that i can file a jira for.  The 
first 4 would be done for each jira filed.
I'm proposing we do these things and as such if people agree we would also 
document those things in the committers or developers guide and send email to 
the list. 
 
TomOn Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen 
 wrote:  
 
 Generally: if someone thinks correctness fix X should be backported further, 
I'd say just do it, if it's to an active release branch (see below). Anything 
that important has to outweigh most any other concern, like behavior changes.

On Mon, Aug 13, 2018 at 11:08 AM Tom Graves  wrote:
I'm not really sure what you mean by this, this proposal is to introduce a 
process for this type of issue so its at least brought to peoples attention. We 
can't do anything to make people work on certain things.  If they aren't raised 
as important issues then its really easy to miss these things.  If its a 
blocker we should also not be doing any new releases without a fix for it which 
may motivate people to look at it.

I mean, what are concrete steps beyond saying this is a problem? That's the 
important thing to discuss.
There's a good one here: let's say anything that's likely to be a correctness 
or data loss issue should automatically be labeled 'correctness' as such and 
set to Blocker. 
That can go into the how-to-contribute manual in the docs and in a note to 
dev@.  
I agree it would be good for us to make it more official about which branches 
are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x 
since we recently did releases of all of these.  Since 2.4 will be coming out 
we should definitely think about stop maintaining 2.1.x.  Perhaps we need a 
table on our release page about this.  But this should be a separate thread.


I propose writing something like this in the 'versioning' doc page, to at least 
establish a policy:
Minor release branches will, generally, be maintained with bug fixes releases 
for a period of 18 months. For example, branch 2.1.x is no longer considered 
maintained as of July 2018, 18 months after the release of 2.1.0 in December 
2106.
This gives us -- and more importantly users -- some understanding of what to 
expect for backporting and fixes.

I am going to revive the thread about adding PMC / committers as it's overdue. 
That may not do much, but, more hands to do more work ought to possibly free up 
people to focus on deeper harder issues.  

Re: [DISCUSS] Handling correctness/data loss jiras

2018-08-13 Thread Tom Graves
 
Not a specific jira but was looking at all the recent jiras with the 
"correctness" label and things are definitely being handled in consistently in 
my opinion (https://issues.apache.org/jira/issues/?jql=labels+%3D+correctness). 
   The inconsistencies are in the things I've mentioned above.  Priority is not 
set high enough, description is not clear,  some backported to 2.2, some not.  
Obviously there could be ones without the "correctness" label as well since 
until recently I was also not aware that this label should be applied for this 
type of issues.
We have no real guidelines in this area for developers and committers to follow 
so I think defining some would help everyone. 

I realize everyone's time is important and everyone has different priorities 
but I think this sort of issue would be one we as a community should take care 
of above everything else.  If I'm a business using Apache Spark for business 
critical things and I find that there is data loss or corruption issues 
consistently in the releases and its not our highest priority to fix, I'm going 
to very hesitant to use and stay with Spark. 
One specific example of priority is in the 2.4 code freeze/release thread where 
it was brought up to release without SPARK-23243. And really we have done a 
bunch of releases without this, but until recently it wasn't marked as a 
blocker as well.  I'll admit that I missed this jira when it was filed and only 
recently became aware of it.  I changed the priority on it.   
|  I share frustration that Somebody should be working on Important Things, but 
don't think the difference between getting those done and not done is reminding 
people that Important Things need doing. What's the cause that leads to 
concrete corrective action?
I'm not really sure what you mean by this, this proposal is to introduce a 
process for this type of issue so its at least brought to peoples attention. We 
can't do anything to make people work on certain things.  If they aren't raised 
as important issues then its really easy to miss these things.  If its a 
blocker we should also not be doing any new releases without a fix for it which 
may motivate people to look at it.
I agree it would be good for us to make it more official about which branches 
are being maintained.  I think at this point its still 2.1.x, 2.2.x, and 2.3.x 
since we recently did releases of all of these.  Since 2.4 will be coming out 
we should definitely think about stop maintaining 2.1.x.  Perhaps we need a 
table on our release page about this.  But this should be a separate thread.

Tom
On Monday, August 13, 2018, 9:03:42 AM CDT, Sean Owen  
wrote:  
 
 I doubt the question is whether people want to take such issues seriously -- 
all else equal, of course everyone does. 
A JIRA label plus place in the release notes sounds like a good concrete step 
that isn't happening consistently now. That's a clear flag that at least one 
person believes issue X is a blocker. 

Is this about specific JIRAs? I think it's more useful to illustrate in the 
context of specific issues. For example I haven't been following JIRAs well, 
and don't know what is being contested here.
I share frustration that Somebody should be working on Important Things, but 
don't think the difference between getting those done and not done is reminding 
people that Important Things need doing. What's the cause that leads to 
concrete corrective action?
Do we need more committers? Fewer new features? More conservative releases? 
Less work on X to work on this?
Lastly you raise an important question as an aside, one we haven't answered: 
when does a branch go inactive? I am sure 2.0.x is inactive, de facto, along 
with all 1.x. I think 2.1.x is inactive too. Should we put any rough guidance 
in place? a branch is maintained for 12-18 months?



On Mon, Aug 13, 2018 at 8:45 AM Tom Graves  wrote:

Hello all,
I've noticed some inconsistencies in the way we are handling data 
loss/correctness issues.  I think we need to take these very seriously as they 
could be causing businesses real money and impacting real decisions and 
business logic.   I would like to discuss how we can make sure these are 
handled consistently and with urgency going forward.  
A few things I would like to propose are below.  Most of these are up to the 
developers and committers to ensure happen so want to know what everyone thinks 
and if people have other ideas?
- label any correctness/data loss jira with "correctness"- jira marked as 
blocker by default if someone suspects a corruption/loss issue- Make sure 
description is clear about when it occurs and impact to the user.   - ensure 
its back ported to all active branches- See if we can have a separate section 
in the release notes for these

Thanks,Tom Graves
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Tom Graves
 I agree with Imran, we need to fix SPARK-23243 and any correctness issues for 
that matter.
Tom
On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid 
 wrote:  
 
 On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about what 
is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
long-standing issue, not a regression.

This is a really serious data loss bug.  Yes its very complex, but we 
absolutely have to fix this, I really think it should be in 2.4.Has worked on 
it stopped?  

[DISCUSS] Handling correctness/data loss jiras

2018-08-13 Thread Tom Graves
Hello all,
I've noticed some inconsistencies in the way we are handling data 
loss/correctness issues.  I think we need to take these very seriously as they 
could be causing businesses real money and impacting real decisions and 
business logic.   I would like to discuss how we can make sure these are 
handled consistently and with urgency going forward.  
A few things I would like to propose are below.  Most of these are up to the 
developers and committers to ensure happen so want to know what everyone thinks 
and if people have other ideas?
- label any correctness/data loss jira with "correctness"- jira marked as 
blocker by default if someone suspects a corruption/loss issue- Make sure 
description is clear about when it occurs and impact to the user.   - ensure 
its back ported to all active branches- See if we can have a separate section 
in the release notes for these

Thanks,Tom Graves

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Tom Graves
 I would like to get clarification on our avro compatibility story before the 
release.  anyone interested please look at - 
https://issues.apache.org/jira/browse/SPARK-24924 . I probably should have 
filed a separate jira and can if we don't resolve via discussion there.
Tom 
On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp 
 wrote:  
 
 
According to the status, I think we should wait a few more days. Any objections?


none here.
i'm also pretty certain that waiting until after the code freeze to start 
testing the GHPRB on ubuntu is the wisest course of action for us.
shane -- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Tom Graves
 Shouldn't this be a discuss thread?  
I'm also happy to see more release managers and agree the time is getting 
close, but we should see what features are in progress and see how close things 
are and propose a date based on that.  Cutting a branch to soon just creates 
more work for committers to push to more branches. 
 http://spark.apache.org/versioning-policy.html mentioned the code freeze and 
release branch cut mid-august.

Tom
On Friday, July 6, 2018, 11:47:35 AM CDT, Reynold Xin  
wrote:  
 
 FYI 6 mo is coming up soon since the last release. We will cut the branch and 
code freeze on Aug 1st in order to get 2.4 out on time.
  

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-20 Thread Tom Graves
 fyi, I merged in a couple jira that were critical (and I thought would be good 
to include in the next release) that if we spin another RC will get included, 
we should update the jira SPARK-24755 and SPARK-24677, if anyone disagrees we 
could back those out but I think they would be good to include.
Tom
On Thursday, July 19, 2018, 8:13:23 PM CDT, Saisai Shao 
 wrote:  
 
 Sure, I can wait for this and create another RC then.
Thanks,Saisai
Xiao Li  于2018年7月20日周五 上午9:11写道:

Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I created. 
The PR has been created. Since this is not rare, let us merge it to 2.3.2? 
Reynold' PR is to get rid of AnalysisBarrier. That is better than multiple 
patches we added for AnalysisBarrier after 2.3.0 release. We can target it to 
2.4. 
Thanks, 
Xiao
2018-07-19 17:48 GMT-07:00 Saisai Shao :

I see, thanks Reynold.
Reynold Xin  于2018年7月20日周五 上午8:46写道:

Looking at the list of pull requests it looks like this is the ticket: 
https://issues.apache.org/jira/browse/SPARK-24867


On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:

I don't think my ticket should block this release. It's a big general 
refactoring.
Xiao do you have a ticket for the bug you found?

On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao  wrote:

Hi Xiao,
Are you referring to this JIRA 
(https://issues.apache.org/jira/browse/SPARK-24865)?
Xiao Li  于2018年7月20日周五 上午2:41写道:

dfWithUDF.cache()
dfWithUDF.write.saveAsTable("t")
dfWithUDF.write.saveAsTable("t1")
Cached data is not being used. It causes a big performance regression. 



2018-07-19 11:32 GMT-07:00 Sean Owen :

What regression are you referring to here? A -1 vote really needs a rationale.

On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:

I would first vote -1. 
I might find another regression caused by the analysis barrier. Will keep you 
posted. 










  

[ANNOUNCE] Apache Spark 2.2.2

2018-07-10 Thread Tom Graves
We are happy to announce the availability of Spark 2.2.2!
Apache Spark 2.2.2 is a maintenance release, based on the branch-2.2 
maintenance branch of Spark. We strongly recommend all 2.2.x users to upgrade 
to this stable release. The release notes are available at 
http://spark.apache.org/releases/spark-release-2-2-2.html

To download Apache Spark 2.2.2 visit http://spark.apache.org/downloads.html. 
This version of Spark is also available on Maven and PyPI.
We would like to acknowledge all community members for contributing patches to 
this release.



[RESULT] [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
The vote passes. Thanks to all who helped with the release!

I'll start publishing everything tomorrow, and an announcement will
be sent when artifacts have propagated to the mirrors (probably
early next week).

+1 (* = binding):
- Marcelo Vanzin *
- Sean Owen *
- Tom Graves *
- Holder Kaurau *- Dongjoon Hyun- Takeshi Yamamuro- Wenchen Fan *- Zhenya Sun

+0: None

-1: None


Thanks,Tom Graves

Re: [VOTE] Spark 2.2.2 (RC2)

2018-07-02 Thread Tom Graves
 I forgot to post it, I'm +1.
Tom
On Monday, July 2, 2018, 12:19:08 AM CDT, Holden Karau 
 wrote:  
 
 Leaving documents aside (I think we should maybe have a thread on how we want 
to handle doc changes to existing releases on dev@) I'm +1 PySpark venv checks 
out.
On Sun, Jul 1, 2018 at 9:40 PM, Hyukjin Kwon  wrote:

Let me leave a note about https://issues.apache. org/jira/browse/SPARK-24530.

The Python documentation should be built against Python 3's Sphinx for now as a 
workaround.
There was an issue found, SPARK-24530 and I am now trying to update the 
documentation, release process, and probably the Makefile script to add a 
control of Python executable version (https://github.com/apache/ 
spark/pull/21659), for now. There are more reasons for this proposal and it's 
described in the PR. Please refer the PR description there, for more 
information.

Strictly, I believe it doesn't block the release in my humble opinion .. I will 
try to update the documentation very soon with the change proposed in 
https://github.com/apache/ spark/pull/21659.
For clarification, please proceed this vote and release orthogonally if it 
sounds fine all to you, or let me know if you guys think differently. To me, I 
think it sounds fine.


2018년 6월 29일 (금) 오전 1:42, Dongjoon Hyun 님이 작성:

+1
Tested on CentOS 7.4 and Oracle JDK 1.8.0_171.

Bests,Dongjoon.

On Thu, Jun 28, 2018 at 7:24 AM Takeshi Yamamuro  wrote:

+1

I run tests on a EC2 m4.2xlarge instance;[ec2-user]$ java -versionopenjdk 
version "1.8.0_171"OpenJDK Runtime Environment (build 1.8.0_171-b10)OpenJDK 
64-Bit Server VM (build 25.171-b10, mixed mode)



On Thu, Jun 28, 2018 at 11:38 AM Wenchen Fan  wrote:

+1

On Thu, Jun 28, 2018 at 10:19 AM zhenya Sun  wrote:

+1

在 2018年6月28日,上午10:15,Hyukjin Kwon  写道:
+1

2018년 6월 28일 (목) 오전 8:42, Sean Owen 님이 작성:

+1 from me too.

On Wed, Jun 27, 2018 at 3:31 PM Tom Graves  wrote:

 Please vote on releasing the following candidate as Apache Spark version 2.2.2.

The vote is open until Mon, July 2nd @ 9PM UTC (2PM PDT) and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.2.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.2-rc2 (commit fc28ba3db7185e84b6dbd02ad8ef8f 
1d06b9e3c6):
https://github.com/apache/ spark/tree/v2.2.2-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/ dist/dev/spark/v2.2.2-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/ dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/ content/repositories/ orgapachespark-1276/
The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/ dist/dev/spark/v2.2.2-rc2- docs/
The list of bug fixes going into 2.2.2 can be found at the following URL:
https://issues.apache.org/ jira/projects/SPARK/versions/ 12342171


Notes:

- RC1 was not sent for a vote. I had trouble building it, and by the time I got
  things fixed, there was a blocker bug filed. It was already tagged in git
  at that time.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

== =
What should happen to JIRA tickets still targeting 2.2.2?
== =

The current list of open tickets targeted at 2.2.2 can be found at:
https://issues.apache.org/ jira/projects/SPARK and search for "Target 
Version/s" = 2.2.2


Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- Tom Graves







-- 
---
Takeshi Yamamuro






-- 
Twitter: https://twitter.com/holdenkarau
  

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Tom Graves
 Right we say we support R3.1+ but we never actually did, so agree its a bug 
but its not a regression since we never really supported them or tested with 
them and its not a logic or security bug that ends in corruptions or bad 
behavior so in my opinion its not a blocker.   Again I'm fine with adding it 
though if others agree.   Maybe we should really change our documentation to 
state more clearly what versions we know it works with and have tested with 
since someone could read R3.1+ as it works with R4 (once released) which very 
well might not be the case.   

I'm +1 on the release.
Tom
On Thursday, June 28, 2018, 10:28:21 AM CDT, Felix Cheung 
 wrote:  
 
 Not pushing back, but our support message has always been R 3.1+ so it a bit 
off to say we don’t support newer releases.
https://spark.apache.org/docs/2.1.2/
But looking back, this was found during 2.1.2 RC2 and didn’t fix (in time) for 
2.1.2?
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
Since it isn’t a regression I’d say +1 from me.

From: Tom Graves 
Sent: Thursday, June 28, 2018 6:56:16 AM
To: Marcelo Vanzin; Felix Cheung
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2) If this is just supporting newer versions 
of R that 2.1 never supported then I would say its not a blocker. But if you 
feel its useful enough then I would say its up to Marcelo if he wants to pull 
in and spin another rc.
Tom 
On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:

Yes, this is broken with newer version of R.
We check explicitly for warning for the R check which should fail the test run.
From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)  Not sure I understand that bug. Is it a 
compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging r

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-28 Thread Tom Graves
 If this is just supporting newer versions of R that 2.1 never supported then I 
would say its not a blocker. But if you feel its useful enough then I would say 
its up to Marcelo if he wants to pull in and spin another rc.
Tom 
On Wednesday, June 27, 2018, 8:57:25 PM CDT, Felix Cheung 
 wrote:  
 
 Yes, this is broken with newer version of R.
We check explicitly for warning for the R check which should fail the test run.
From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2) Not sure I understand that bug. Is it a 
compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed, there was a blocker bug filed. It was already tagged in
>> git
>> at that time.
>>
>> - If testing the source package, I recommend using Java 8, even though 2.1
>> supports Java 7 (and the RC was built with JDK 7). This is because Maven
>> Central has updated some configuration that makes the default Java 7 SSL
>> config not work.
>>
>> - There are Maven artifacts published for Scala 2.10, but binary
>> releases are only
>> available for Scala 2.11. This matches the previous release (2.1.2),
>> but if there's
>> a ne

[VOTE] Spark 2.2.2 (RC2)

2018-06-27 Thread Tom Graves
 Please vote on releasing the following candidate as Apache Spark version 2.2.2.

The vote is open until Mon, July 2nd @ 9PM UTC (2PM PDT) and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.2.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.2.2-rc2 (commit 
fc28ba3db7185e84b6dbd02ad8ef8f1d06b9e3c6):
https://github.com/apache/spark/tree/v2.2.2-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1276/
The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-docs/
The list of bug fixes going into 2.2.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12342171


Notes:

- RC1 was not sent for a vote. I had trouble building it, and by the time I got
  things fixed, there was a blocker bug filed. It was already tagged in git
  at that time.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.2.2?
===

The current list of open tickets targeted at 2.2.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" 
= 2.2.2


Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


-- Tom Graves

Re: Time for 2.1.3

2018-06-15 Thread Tom Graves
 +1 for doing a 2.1.3 release.  
Tom
On Wednesday, June 13, 2018, 7:28:26 AM CDT, Marco Gaido 
 wrote:  
 
 Yes, you're right Herman. Sorry, my bad.
Thanks.Marco
2018-06-13 14:01 GMT+02:00 Herman van Hövell tot Westerflier 
:

Isn't this only a problem with Spark 2.3.x?
On Wed, Jun 13, 2018 at 1:57 PM Marco Gaido  wrote:

Hi Marcelo,
thanks for bringing this up. Maybe we should consider to include SPARK-24495, 
as it is causing some queries to return an incorrect result.What do you think?
Thanks,Marco
2018-06-13 1:27 GMT+02:00 Marcelo Vanzin :

Hey all,

There are some fixes that went into 2.1.3 recently that probably
deserve a release. So as usual, please take a look if there's anything
else you'd like on that release, otherwise I'd like to start with the
process by early next week.

I'll go through jira to see what's the status of things targeted at
that release, but last I checked there wasn't anything on the radar.

Thanks!

-- 
Marcelo

-- -- -
To unsubscribe e-mail: dev-unsubscribe@spark.apache. org






  

Time for 2.2.2 release

2018-06-06 Thread Tom Graves
Hello all,

I think its time for another 2.2 release.  I took a look at Jira and I don't 
see anything explicitly targeted for 2.2.2 that is not yet complete.
So I'd like to propose to release 2.2.2 soon. If there are important
fixes that should go into the release, please let those be known (by
replying here or updating the bug in Jira), otherwise I'm volunteering
to prepare the first RC soon-ish (by early next week since Spark Summit is this 
week).

Thanks!Tom Graves


Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Tom Graves
 
Testing with spark 2.3 and I see a difference in the sql coalesce talking to 
hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
Query:spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >= 
'20170301' AND dt <= '20170331' AND something IS NOT 
NULL").coalesce(16).show()

in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.   Anyone 
know about this issue or are there some weird config changes, otherwise I'll 
file a jira?
Note I also see a performance difference when reading cached data. Spark 2.3. 
Small query on 19GB cached data, spark 2.3 is 30% worse.  This is only 13 
seconds on spark 2.2 vs 17 seconds on spark 2.3.  Straight up reading from hive 
(orc) seems better though.
Tom


On Thursday, February 1, 2018, 11:23:45 AM CST, Michael Heuer 
 wrote:  
 
 We found two classes new to Spark 2.3.0 that must be registered in Kryo for 
our tests to pass on RC2

org.apache.spark.sql.execution.datasources.BasicWriteTaskStats
org.apache.spark.sql.execution.datasources.ExecutedWriteSummary

https://github.com/bigdatagenomics/adam/pull/1897

Perhaps a mention in release notes?

   michael


On Thu, Feb 1, 2018 at 3:29 AM, Nick Pentreath  wrote:

All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that 
should be everything outstanding.

On Thu, 1 Feb 2018 at 06:21 Yin Huai  wrote:

seems we are not running tests related to pandas in pyspark tests (see my email 
"python tests related to pandas are skipped in jenkins"). I think we should fix 
this test issue and make sure all tests are good before cutting RC3.
On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal  wrote:

Just a quick status update on RC3 -- SPARK-23274 was resolved yesterday and 
tests have been quite healthy throughout this week and the last. I'll cut the 
new RC as soon as the remaining blocker (SPARK-23202) is resolved.

On 30 January 2018 at 10:12, Andrew Ash  wrote:

I'd like to nominate SPARK-23274 as a potential blocker for the 2.3.0 release 
as well, due to being a regression from 2.2.0.  The ticket has a simple repro 
included, showing a query that works in prior releases but now fails with an 
exception in the catalyst optimizer.
On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal  wrote:

This vote has failed due to a number of aforementioned blockers. I'll follow up 
with RC3 as soon as the 2 remaining (non-QA) blockers are resolved: 
https://s.apache. org/oXKi


On 25 January 2018 at 12:59, Sameer Agarwal  wrote:



Most tests pass on RC2, except I'm still seeing the timeout caused by 
https://issues.apache.org/ jira/browse/SPARK-23055 ; the tests never finish. I 
followed the thread a bit further and wasn't clear whether it was subsequently 
re-fixed for 2.3.0 or not. It says it's resolved along with 
https://issues.apache. org/jira/browse/SPARK-22908  for 2.3.0 though I am still 
seeing these tests fail or hang:
- subscribing topic by name from earliest offsets (failOnDataLoss: false)- 
subscribing topic by name from earliest offsets (failOnDataLoss: true)

Sean, while some of these tests were timing out on RC1, we're not aware of any 
known issues in RC2. Both maven (https://amplab.cs.berkeley. 
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ 
spark-branch-2.3-test-maven- hadoop-2.6/146/testReport/org. 
apache.spark.sql.kafka010/ history/) and sbt (https://amplab.cs.berkeley. 
edu/jenkins/view/Spark%20QA% 20Test%20(Dashboard)/job/ 
spark-branch-2.3-test-sbt- hadoop-2.6/123/testReport/org. 
apache.spark.sql.kafka010/ history/) historical builds on jenkins for 
org.apache.spark.sql. kafka010 look fairly healthy. If you're still seeing 
timeouts in RC2, can you create a JIRA with any applicable build/env info?
 
On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:

I'm not seeing that same problem on OS X and /usr/bin/tar. I tried unpacking it 
with 'xvzf' and also unzipping it first, and it untarred without warnings in 
either case.
I am encountering errors while running the tests, different ones each time, so 
am still figuring out whether there is a real problem or just flaky tests.
These issues look like blockers, as they are inherently to be completed before 
the 2.3 release. They are mostly not done. I suppose I'd -1 on behalf of those 
who say this needs to be done first, though, we can keep testing.
SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrellaSPARK-23114 Spark R 2.3 QA 
umbrella
Here are the remaining items targeted for 2.3:
SPARK-15689 Data source API v2SPARK-20928 SPIP: Continuous Processing Mode for 
Structured StreamingSPARK-21646 Add new type coercion rules to compatible with 
HiveSPARK-22386 Data Source V2 improvementsSPARK-22731 Add a test for ROWID 
type to OracleIntegrationSuiteSPARK-22735 Add VectorSizeHint to ML features 
documentationSPARK-22739 Additional Expression 

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Tom Graves
 +1 for the idea and feature, but I think the design is definitely lacking 
detail on the internal changes needed and how the execution pieces work and the 
communication.  Are you planning on posting more of those details or were you 
just planning on discussing in PR?
Tom
On Wednesday, November 1, 2017, 11:29:21 AM CDT, Debasish Das 
 wrote:  
 
 +1
Is there any design doc related to API/internal changes ? Will CP be the 
default in structured streaming or it's a mode in conjunction with exisiting 
behavior.
Thanks.Deb
On Nov 1, 2017 8:37 AM, "Reynold Xin"  wrote:

Earlier I sent out a discussion thread for CP in Structured Streaming:
https://issues.apache.org/ jira/browse/SPARK-20928
It is meant to be a very small, surgical change to Structured Streaming to 
enable ultra-low latency. This is great timing because we are also designing 
and implementing data source API v2. If designed properly, we can have the same 
data source API working for both streaming and batch.

Following the SPIP process, I'm putting this SPIP up for a vote.
+1: Let's go ahead and design / implement the SPIP.+0: Don't really care.-1: I 
do not think this is a good idea for the following reasons.



  

  1   2   >