from:"Felix Cheung"

RE: SparkR read.df Option type doesn't match

2015-11-27 Thread Felix Cheung

Yes - please see the code example on the SparkR API doc: 
http://spark.apache.org/docs/latest/api/R/read.df.html
Suggestion or contribution to improve the doc is welcome!

 
> Date: Thu, 26 Nov 2015 15:08:31 -0700
> From: s...@phemi.com
> To: dev@spark.apache.org
> Subject: Re: SparkR read.df Option type doesn't match
> 
> I found the answer myself.
> options should be added like:
> read.df(sqlContext,path=NULL,source="***",option1="",option2="" )
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-read-df-Option-type-doesn-t-match-tp15365p15370.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

RE: Are we running SparkR tests in Jenkins?

2016-01-17 Thread Felix Cheung

I think that breaks sparkR, the commandline script, and Jenkins, in which 
run-test.sh is calling sparkR.
I'll work on this - since this also affects my PR #10652...

Date: Fri, 15 Jan 2016 15:33:13 -0800
Subject: Re: Are we running SparkR tests in Jenkins?
From: zjf...@gmail.com
To: shiva...@eecs.berkeley.edu
CC: r...@databricks.com; hvanhov...@questtec.nl; dev@spark.apache.org; 
shivaram.venkatara...@gmail.com

Created https://issues.apache.org/jira/browse/SPARK-12846

On Fri, Jan 15, 2016 at 3:29 PM, Jeff Zhang  wrote:
 Right, I forget the documentation, will create a follow up jira.  
On Fri, Jan 15, 2016 at 3:23 PM, Shivaram Venkataraman 
 wrote:
Ah I see. I wasn't aware of that PR. We should do a find and replace

in all the documentation and rest of the repository as well.

Shivaram

On Fri, Jan 15, 2016 at 3:20 PM, Reynold Xin  wrote:

> +Shivaram

>

> Ah damn - we should fix it.

>

> This was broken by https://github.com/apache/spark/pull/10658 - which

> removed a functionality that has been deprecated since Spark 1.0.

>

>

>

>

>

> On Fri, Jan 15, 2016 at 3:19 PM, Herman van Hövell tot Westerflier

>  wrote:

>>

>> Hi all,

>>

>> I just noticed the following log entry in Jenkins:

>>

>>> 

>>> Running SparkR tests

>>> 

>>> Running R applications through 'sparkR' is not supported as of Spark 2.0.

>>> Use ./bin/spark-submit 

>>

>>

>> Are we still running R tests? Or just saying that this will be deprecated?

>>

>> Kind regards,

>>

>> Herman van Hövell tot Westerflier

>>

>

-

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org

For additional commands, e-mail: dev-h...@spark.apache.org

-- 
Best Regards

Jeff Zhang

-- 
Best Regards

Jeff Zhang

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Felix Cheung

I mean not exposed from the SparkR API.
Calling it from R without a SparkR API would require either a serializer change 
or a JVM wrapper function.



On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
<felixcheun...@hotmail.com> wrote:





That does but it's a bit hard to call from R since it is not exposed.






On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui" <rui@intel.com> wrote:





DataFrameWrite.jdbc() does not work?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Sunday, February 7, 2016 9:54 AM
To: Andrew Holway <andrew.hol...@otternetworks.de>; dev@spark.apache.org
Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)

Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format

For instance, this does not work in Scala either
df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com<http://something.rds.amazonaws.com>:3306?user=user=password").option("dbtable",
 "table").save()

For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R.

_
From: Andrew Holway 
<andrew.hol...@otternetworks.de<mailto:andrew.hol...@otternetworks.de>>
Sent: Saturday, February 6, 2016 11:22 AM
Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Hi,

I have a thread on u...@spark.apache.org<mailto:u...@spark.apache.org> but I 
think this might require developer attention.

I'm reading data from a database: This is working well.

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass<http://database.foo.eu-west-1.rds.amazonaws.com:3306/?user=user=pass>")

When I try and write something back to the DB I see this following error:


> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass<http://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass>",
>  dbtable="db.table", mode="append")



16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select.

at scala.sys.package$.error(package.scala:27)

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)

at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)

at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)

at io.netty.channel.SimpleChannelIn



Any ideas on a workaround?



Thanks,



Andrew

RE: Fwd: Writing to jdbc database from SparkR (1.5.2)

2016-02-07 Thread Felix Cheung

Correct :)



_
From: Sun, Rui <rui@intel.com>
Sent: Sunday, February 7, 2016 5:19 AM
Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2)
To:  <dev@spark.apache.org>, Felix Cheung <felixcheun...@hotmail.com>, Andrew 
Holway <andrew.hol...@otternetworks.de>


 

This should be solved by your pending PR 
https://github.com/apache/spark/pull/10480, right?

    

From: Felix Cheung [mailto:felixcheun...@hotmail.com] 
 Sent: Sunday, February 7, 2016 8:50 PM
 To: Sun, Rui <rui@intel.com>; Andrew Holway 
<andrew.hol...@otternetworks.de>; dev@spark.apache.org
 Subject: RE: Fwd: Writing to jdbc database from SparkR (1.5.2) 

 

I mean not exposed from the SparkR API.
 Calling it from R without a SparkR API would require either a serializer 
change or a JVM wrapper function.
 
  

On Sun, Feb 7, 2016 at 4:47 AM -0800, "Felix Cheung" 
<felixcheun...@hotmail.com> wrote:   

That does but it's a bit hard to call from R since it is not exposed.  

   


 


On Sat, Feb 6, 2016 at 11:57 PM -0800, "Sun, Rui" <rui@intel.com> wrote:
   

DataFrameWrite.jdbc() does not work?   

    

From: Felix Cheung [mailto:felixcheun...@hotmail.com] 
 Sent: Sunday, February 7, 2016 9:54 AM
 To: Andrew Holway <andrew.hol...@otternetworks.de>; dev@spark.apache.org
 Subject: Re: Fwd: Writing to jdbc database from SparkR (1.5.2)   

 

Unfortunately I couldn't find a simple workaround. It seems to be an issue with 
DataFrameWriter.save() that does not work with jdbc source/format   
  

  

For instance, this does not work in Scala either 

df1.write.format("jdbc").mode("overwrite").option("url", 
"jdbc:mysql://something.rds.amazonaws.com:3306?user=user=password").option("dbtable",
 "table").save()             

  

For Spark 1.5.x, it seems the best option would be to write a JVM wrapper and 
call it from R. 

   

_
 From: Andrew Holway <andrew.hol...@otternetworks.de>
 Sent: Saturday, February 6, 2016 11:22 AM
 Subject: Fwd: Writing to jdbc database from SparkR (1.5.2)
 To: <dev@spark.apache.org> 

Hi,

 

I have a thread on  u...@spark.apache.org but I think this might require 
developer attention.   

    

I'm reading data from a database: This is working well. 
 

> df <- read.df(sqlContext, source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass")
>

    

When I try and write something back to the DB I see this following error:   
 

 

> write.df(fooframe, path="NULL", source="jdbc", 
> url="jdbc:mysql://database.foo.eu-west-1.rds.amazonaws.com:3306?user=user=pass",
>  dbtable="db.table", mode="append") 

  

16/02/06 19:05:43 ERROR RBackendHandler: save on 2 failed 

Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :  

  java.lang.RuntimeException: 
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not allow 
create table as select. 

at scala.sys.package$.error(package.scala:27) 

at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:200)
 

at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) 


at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1855) 

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
  

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 

at java.lang.reflect.Method.invoke(Method.java:497) 

at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
 

at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)   
  

at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)   
  

at io.netty.channel.SimpleChannelIn 

  

Any ideas on a workaround? 

  

Thanks, 

  

Andrew

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Felix Cheung

+1

Tested on Ubuntu, ran a bunch of SparkR tests, found a broken link in doc but 
not a blocker.

_
From: Michael Armbrust >
Sent: Friday, July 22, 2016 3:18 PM
Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC5)
To: >
Cc: Reynold Xin >

+1

On Fri, Jul 22, 2016 at 2:42 PM, Holden Karau 
> wrote:
+1 (non-binding)

Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with a 
simple structured streaming project (spark-structured-streaming-ml) & 
spark-testing-base & high-performance-spark-examples (minor changes required 
from preview version but seem intentional & jetty conflicts with out of date 
testing library - but not a Spark problem).

On Fri, Jul 22, 2016 at 12:45 PM, Luciano Resende 
> wrote:
+ 1 (non-binding)

Found a minor issue when trying to run some of the docker tests, but nothing 
blocking the release. Will create a JIRA for that.

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. 
The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...

The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: 
https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1195/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/

=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 1.x.

==
What justifies a -1 vote for this release?
==
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features 
will not necessarily block this release. Note that historically Spark 
documentation has been published on the website separately from the main 
release so we do not need to block the release due to documentation errors 
either.

--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Felix Cheung

+1 for all the great work going in for this, HyukjinKwon, and +1 on what Sean 
says about "Jenkins builds with Java 8" and we should catch these nasty 
javadoc8 issue quickly.

I think that would be the great first step to move away from java 7

_
From: Reynold Xin >
Sent: Tuesday, February 7, 2017 4:48 AM
Subject: Re: PSA: Java 8 unidoc build
To: Sean Owen >
Cc: Josh Rosen >, 
Joseph Bradley >, 
>

I don't know if this would help but I think we can also officially stop 
supporting Java 7 ...

On Tue, Feb 7, 2017 at 1:06 PM, Sean Owen 
> wrote:
I believe that if we ran the Jenkins builds with Java 8 we would catch these? 
this doesn't require dropping Java 7 support or anything.

@joshrosen I know we are just now talking about modifying the Jenkins jobs to 
remove old Hadoop configs. Is it possible to change the master jobs to use Java 
8? can't hurt really in any event.

Or maybe I'm mistaken and they already run Java 8 and it doesn't catch this 
until Java 8 is the target.

Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks 
Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.

On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
> wrote:
Public service announcement: Our doc build has worked with Java 8 for brief 
time periods, but new changes keep breaking the Java 8 unidoc build.  Please be 
aware of this, and try to test doc changes with Java 8!  In general, it is 
stricter than Java 7 for docs.

A shout out to @HyukjinKwon and others who have made many fixes for this!  See 
these sample PRs for some issues causing failures (especially around links):
https://github.com/apache/spark/pull/16741
https://github.com/apache/spark/pull/16604

Thanks,
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Felix Cheung

Congrats and welcome!!

From: Reynold Xin 
Sent: Tuesday, January 24, 2017 10:13:16 AM
To: dev@spark.apache.org
Cc: Burak Yavuz; Holden Karau
Subject: welcoming Burak and Holden as committers

Hi all,

Burak and Holden have recently been elected as Apache Spark committers.

Burak has been very active in a large number of areas in Spark, including 
linear algebra, stats/maths functions in DataFrames, Python/R APIs for 
DataFrames, dstream, and most recently Structured Streaming.

Holden has been a long time Spark contributor and evangelist. She has written a 
few books on Spark, as well as frequent contributions to the Python API to 
improve its usability and performance.

Please join me in welcoming the two!

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Felix Cheung

Congratulations!

From: Xuefu Zhang 
Sent: Monday, February 13, 2017 11:29:12 AM
To: Xiao Li
Cc: Holden Karau; Reynold Xin; dev@spark.apache.org
Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committer

Congratulations, Takuya!

--Xuefu

On Mon, Feb 13, 2017 at 11:25 AM, Xiao Li 
> wrote:
Congratulations, Takuya!

Xiao

2017-02-13 11:24 GMT-08:00 Holden Karau 
>:
Congratulations Takuya-san :D!

On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin 
> wrote:
Hi all,

Takuya-san has recently been elected an Apache Spark committer. He's been 
active in the SQL area and writes very small, surgical patches that are high 
quality. Please join me in congratulating Takuya-san!

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Felix Cheung

Hi Seth

Re: "The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. "

We are adopting a Shepherd model, as described in the JIRA Joseph has, in 
which, when assigned, the Shepherd will see it through with the contributor to 
make sure it lands with the target release.

I'm sure Joseph can explain it better than I do ;)

_
From: Mingjie Tang >
Sent: Thursday, January 19, 2017 10:30 AM
Subject: Re: Feedback on MLlib roadmap process proposal
To: Seth Hendrickson 
>
Cc: Joseph Bradley >, 
>

+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson 
> wrote:
I think the proposal laid out in SPARK-18813 is well done, and I do think it is 
going to improve the process going forward. I also really like the idea of 
getting the community to vote on JIRAs to give some of them priority - provided 
that we listen to those votes, of course. The biggest problem I see is that we 
do have several active contributors and those who want to help implement these 
changes, but PRs are reviewed rather sporadically and I imagine it is very 
difficult for contributors to understand why some get reviewed and some do not. 
The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. A hard thing to do in open source, no doubt, but 
even if we have to limit the scope of such issues to a very small subset, it's 
a gain for all I think.

On a related note, I would love to hear some discussion on the higher level 
goal of Spark MLlib (if this derails the original discussion, please let me 
know and we can discuss in another thread). The roadmap does contain specific 
items that help to convey some of this (ML parity with MLlib, model 
persistence, etc...), but I'm interested in what the "mission" of Spark MLlib 
is. We often see PRs for brand new algorithms which are sometimes rejected and 
sometimes not. Do we aim to keep implementing more and more algorithms? Or is 
our focus really, now that we have a reasonable library of algorithms, to 
simply make the existing ones faster/better/more robust? Should we aim to make 
interfaces that are easily extended for developers to easily implement their 
own custom code (e.g. custom optimization libraries), or do we want to restrict 
things to out-of-the box algorithms? Should we focus on more flexible, general 
abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this 
discussion may have happened, but I think it would be useful to either revisit 
it or restate it here for some of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
> wrote:
Hi all,

This is a general call for thoughts about the process for the MLlib roadmap 
proposed in SPARK-18813.  See the section called "Roadmap process."

Summary:
* This process is about committers indicating intention to shepherd and review.
* The goal is to improve visibility and communication.
* This is fairly orthogonal to the SIP discussion since this proposal is more 
about setting release targets than about proposing future plans.

Thanks!
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Felix Cheung

+1 on this proposal and everyone can contribute to updates and discussions on 
JIRAs

Will be great if this could be put on the Spark wiki.

On Sat, Oct 8, 2016 at 9:05 AM -0700, "Ted Yu" 
> wrote:

Makes sense.

I trust Hyukjin, Holden and Cody's judgement in respective areas.

I just wish to see more participation from the committers.

Thanks

> On Oct 8, 2016, at 8:27 AM, Sean Owen  wrote:
>
> Hyukjin

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Felix Cheung

Should we just link to

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Sun, Oct 9, 2016 at 10:09 AM -0700, "Hyukjin Kwon" 
> wrote:

Thanks for confirming this, Sean. I filed this in 
https://issues.apache.org/jira/browse/SPARK-17840

I would appreciate if anyone who has a better writing skills better than me 
tries to fix this.

I don't want to let reviewers make an effort to correct the grammar.

On 10 Oct 2016 1:34 a.m., "Sean Owen" 
> wrote:
Yes, it's really CONTRIBUTING.md that's more relevant, because github displays 
a link to it when opening pull requests. 
https://github.com/apache/spark/blob/master/CONTRIBUTING.md  There is also the 
pull request template: 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

I wouldn't want to duplicate info too much, but more pointers to a single 
source of information seems OK. Although I don't know if it will help much, 
sure, pointers from README.md are OK.

On Sun, Oct 9, 2016 at 3:47 PM Hyukjin Kwon 
> wrote:
Hi all,

I just noticed the README.md (https://github.com/apache/spark) does not 
describe the steps or links to follow for creating a PR or JIRA directly. I 
know probably it is sensible to search google about the contribution guides 
first before trying to make a PR/JIRA but I think it seems not enough when I 
see some inappropriate PRs/JIRAs time to time.

I guess flooding JIRAs and PRs is problematic (assuming from the emails in dev 
mailing list) and I think we should explicitly mention and describe this in the 
README.md and pull request template[1].

(I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true that we 
still have some PRs or JIRAs not following the documentation.)

So, my suggestions are as below:

- Create a section maybe "Contributing To Apache Spark" describing the Wiki and 
CONTRIBUTING.md[2] in the README.md.

- Describe an explicit warning in pull request template[1], for example, 
"Please double check if your pull request is from a branch to a branch. In most 
cases, this change is not appropriate. Please ask to mailing list 
(http://spark.apache.org/community.html) if you are not sure."

[1]https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
[2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
[3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

Thank you all.

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Felix Cheung

+1 on longer release cycle at schedule and more maintenance releases.

_
From: Mark Hamstra >
Sent: Tuesday, September 27, 2016 2:01 PM
Subject: Re: [discuss] Spark 2.x release cadence
To: Reynold Xin >
Cc: >

+1

And I'll dare say that for those with Spark in production, what is more 
important is that maintenance releases come out in a timely fashion than that 
new features are released one month sooner or later.

On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin 
> wrote:
We are 2 months past releasing Spark 2.0.0, an important milestone for the 
project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we 
had for the 1.x line, and we never explicitly discussed what the release 
cadence should look like for 2.x. Thus this email.

During Spark 1.x, roughly every three months we make a new 1.x feature release 
(e.g. 1.5.0 comes out three months after 1.4.0). Development happened primarily 
in the first two months, and then a release branch was cut at the end of month 
2, and the last month was reserved for QA and release preparation.

During 2.0.0 development, I really enjoyed the longer release cycle because 
there was a lot of major changes happening and the longer time was critical for 
thinking through architectural changes as well as API design. While I don't 
expect the same degree of drastic changes in a 2.x feature release, I do think 
it'd make sense to increase the length of release cycle so we can make better 
designs.

My strawman proposal is to maintain a regular release cadence, as we did in 
Spark 1.x, and increase the cycle from 3 months to 4 months. This effectively 
gives us ~50% more time to develop (in reality it'd be slightly less than 50% 
since longer dev time also means longer QA time). As for maintenance releases, 
I think those should still be cut on-demand, similar to Spark 1.x, but more 
aggressively.

To put this into perspective, 4-month cycle means we will release Spark 2.1.0 
at the end of Nov or early Dec (and branch cut / code freeze at the end of Oct).

I am curious what others think.

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-10-01 Thread Felix Cheung

+1

Tested and didn't find any blocker - found a few minor R doc issues to follow 
up.

_
From: Reynold Xin >
Sent: Wednesday, September 28, 2016 7:15 PM
Subject: [VOTE] Release Apache Spark 2.0.1 (RC4)
To: >

Please vote on releasing the following candidate as Apache Spark version 2.0.1. 
The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a majority 
of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.1
[ ] -1 Do not release this package because ...

The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38)

This release candidate resolves 301 issues: 
https://s.apache.org/spark-2.0.1-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1203/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/

Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 2.0.0.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series.  Bugs already present in 
2.0.0, missing features, or bugs related to new features will not necessarily 
block this release.

Q: What fix version should I use for patches merging into branch-2.0 from now 
on?
A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC (i.e. 
RC5) is cut, I will change the fix version of those patches to 2.0.1.

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Felix Cheung

Congrats and welcome, Xiao!

_
From: Reynold Xin >
Sent: Monday, October 3, 2016 10:47 PM
Subject: welcoming Xiao Li as a committer
To: Xiao Li >, 
>

Hi all,

Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark 
committer. Xiao has been a super active contributor to Spark SQL. Congrats and 
welcome, Xiao!

- Reynold

Re: SparkR issue with array types in gapply()

2016-10-27 Thread Felix Cheung

This is a R native data.frame behavior.

While arr is a character vector of length = 2,
> arr
[1] "rows= 50" "cols= 2"
> length(arr)
[1] 2


when it is set as R data.frame the character vector is splitted into 2 rows


> data.frame(key, strings = arr, stringsAsFactors = F)
  key strings
1 a rows= 50
2 a cols= 2


> b <- data.frame(key, strings = arr, stringsAsFactors = F)
> sapply(b, class)
key strings
"character" "character"
> b[1,1]
[1] "a"
> b[1,2]
[1] "rows= 50"
> b[2,2]
[1] "cols= 2"


And each is separate in the character column. This causes a schema mismatch 
when it is expecting a string array, not just string when you set schema to 
have  structField('strings', 'array')


_
From: shirisht >
Sent: Tuesday, October 25, 2016 11:51 PM
Subject: SparkR issue with array types in gapply()
To: >


Hello,

I am getting an exception from catalyst when array types are used in the
return schema of gapply() function.

Following is a (made-up) example:


iris$flag = base::sample(1:2, nrow(iris), T, prob = c(0.5,0.5))
irisdf = createDataFrame(iris)

foo = function(key, x) {
nr = nrow(x)
nc = ncol(x)
arr = c( paste("rows=", nr), paste("cols=",nc) )
data.frame(key, strings = arr, stringsAsFactors = F)
}

outSchema = structType( structField('key', 'integer'),
structField('strings', 'array') )
result = SparkR::gapply(irisdf, "flag", foo, outSchema)
d = SparkR::collect(result)


This code throws up the following error:

java.lang.RuntimeException: java.lang.String is not a valid external type
for schema of array
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any thoughts?

Thank you,
Shirish



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-issue-with-array-types-in-gapply-tp19568.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-10 Thread Felix Cheung

+1 I think it's useful to always have a pure SQL way and skip header for plain 
text / csv that lots of companies have.

From: Dongjoon Hyun 
Sent: Friday, December 9, 2016 9:42:58 AM
To: Dongjin Lee; dev@spark.apache.org
Subject: Re: Question about SPARK-11374 (skip.header.line.count)

Thank you for the opinion, Dongjin!

On Thu, Dec 8, 2016 at 21:56 Dongjin Lee 
> wrote:
+1 For this idea. I need it also.

Regards,
Dongjin

On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun 
> wrote:
Hi, All.

Could you give me some opinion?

There is an old SPARK issue, SPARK-11374, about removing header lines from text 
file.

Currently, Spark supports removing CSV header lines by the following way.

```

scala> spark.read.option("header","true").csv("/data").show

+---+---+

| c1| c2|

+---+---+

|  1|  a|

|  2|  b|

+---+---+

```

In SQL world, we can support that like the Hive way, `skip.header.line.count`.

```

scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data' 
TBLPROPERTIES('skip.header.line.count'='1')")

scala> sql("SELECT * FROM t1").show

+---+-+

| id|value|

+---+-+

|  1|a|

|  2|b|

+---+-+

```

Although I made a PR for this based on the JIRA issue, I want to know this is 
really needed feature.

Is it need for your use cases? Or, it's enough for you to remove them in a 
preprocessing stage.

If this is too old and not proper in these days, I'll close the PR and JIRA 
issue as WON'T FIX.

Thank you for all in advance!

Bests,

Dongjoon.

-

To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

--
Dongjin Lee

Software developer in Line+.
So interested in massive-scale machine learning.

facebook: 
www.facebook.com/dongjin.lee.kr
linkedin: 
kr.linkedin.com/in/dongjinleekr
github:  
github.com/dongjinleekr
twitter: www.twitter.com/dongjinleekr

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung

This is likely a factor of your hadoop config and Spark rather then anything 
specific with GraphFrames.

You might have better luck getting assistance if you could isolate the code to 
a simple case that manifests the problem (without GraphFrames), and repost.



From: Ankur Srivastava <ankur.srivast...@gmail.com>
Sent: Thursday, January 5, 2017 3:45:59 PM
To: Felix Cheung; dev@spark.apache.org
Cc: u...@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Adding DEV mailing list to see if this is a defect with ConnectedComponent or 
if they can recommend any solution.

Thanks
Ankur

On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Yes I did try it out and it choses the local file system as my checkpoint 
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: u...@spark.apache.org<mailto:u...@spark.apache.org>

Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <u...@spark.apache.org<mailto:u...@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.d

Re: ml word2vec finSynonyms return type

2016-12-30 Thread Felix Cheung

Could you link to the JIRA here?

What you suggest makes sense to me. Though we might want to maintain 
compatibility and add a new method instead of changing the return type of the 
existing one.


_
From: Asher Krim >
Sent: Wednesday, December 28, 2016 11:52 AM
Subject: ml word2vec finSynonyms return type
To: >
Cc: >, 
Joseph Bradley >


Hey all,

I would like to propose changing the return type of `findSynonyms` in ml's 
Word2Vec:

def findSynonyms(word: String, num: Int): DataFrame = {
  val spark = SparkSession.builder().getOrCreate()
  spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", 
"similarity")
}


I find it very strange that the results are parallelized before being returned 
to the user. The results are already on the driver to begin with, and I can 
imagine that for most usecases (and definitely for ours) the synonyms are 
collected right back to the driver. This incurs both an added cost of shipping 
data to and from the cluster, as well as a more cumbersome interface than 
needed.

Can we change it to just the following?

def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
  wordVectors.findSynonyms(word, num)
}

If the user wants the results parallelized, they can still do so on their own.

(I had brought this up a while back in Jira. It was suggested that the mailing 
list would be a better forum to discuss it, so here we are.)

Thanks,
--
Asher Krim
Senior Software Engineer
[http://cdn2.hubspot.net/hub/137828/file-223457316-png/HubSpot_User_Group_Images/HUG_lrg_HS.png?t=1477096082917]

Re: [ML] [GraphFrames] : Bayesian Network framework

2016-12-30 Thread Felix Cheung

GraphFrames has a Belief Propagation example
Have you checked it out?

graphframes.github.io/api/scala/index.html#org.graphframes.examples.BeliefPropagation$

From: Brian Cajes 
Sent: Friday, December 30, 2016 3:27:13 PM
To: spark-dev
Subject: [ML] [GraphFrames] : Bayesian Network framework

Hi, I'm interested in using (or contributing to an implementation) of a 
Bayesian Network framework within Spark.  Similar to 
https://github.com/jmschrei/pomegranate/blob/master/examples/bayesnet_monty_hall_train.ipynb
 .  I've found a related library for spark: 
https://github.com/HewlettPackard/sandpiper , but it's not quite what I'm 
looking for.  It would be nice if this framework integrated with ML or 
GraphFrames.  Anyone know of any other Bayesian Network frameworks using Spark? 
 If not, would this sort of framework be a worthwhile addition to ml, 
graphframes or spark-packages?

Spark checkpointing

2017-01-07 Thread Felix Cheung

Thanks Steve.

As you have pointed out, we have seen some issues related to cloud storage as 
"file system". I'm looking at checkpointing recently. What do you think would 
be the improvement we could make for "non local" (== reliable?) checkpointing?



From: Steve Loughran <ste...@hortonworks.com>
Sent: Friday, January 6, 2017 9:57:05 AM
To: Ankur Srivastava
Cc: Felix Cheung; u...@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents


On 5 Jan 2017, at 21:10, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:

Yes I did try it out and it choses the local file system as my checkpoint 
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

set fs.default.name to s3n://whatever , or, in spark context, 
spark.hadoop.fs.default.name

However

1. you should really use s3a, if you have the hadoop 2.7 JARs on your classpath.
2. neither s3n or s3a are real filesystems, and certain assumptions that 
checkpointing code tends to make "renames being O(1) atomic calls" do not hold. 
It may be that checkpointing to s3 isn't as robust as you'd like




On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: u...@spark.apache.org<mailto:u...@spark.apache.org>

Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <u...@spark.apache.org<mailto:u...@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, ex

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Felix Cheung

0/+1

Tested a bunch of R package/install cases.
Unfortunately we are still working on SPARK-18817, which looks to be a change 
when going from Spark 1.6 to 2.0. In that case it won't be a blocker.


_
From: vaquar khan <vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>>
Sent: Sunday, December 18, 2016 2:33 PM
Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)
To: Adam Roberts <arobe...@uk.ibm.com<mailto:arobe...@uk.ibm.com>>
Cc: Denny Lee <denny.g@gmail.com<mailto:denny.g@gmail.com>>, Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Liwei Lin 
<lwl...@gmail.com<mailto:lwl...@gmail.com>>, 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


+1 (non-binding)

Regards,
vaquar khan

On Sun, Dec 18, 2016 at 2:33 PM, Adam Roberts 
<arobe...@uk.ibm.com<mailto:arobe...@uk.ibm.com>> wrote:
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest SDK 
for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these failing 
but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with 
informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode
org.apache.spark.sql.hive.HiveSparkSubmitSuite.set hive.metastore.warehouse.dir

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo 
serializer, controlled test environment. These are all open source benchmarks 
anyone can use and experiment with. Elapsed times measured, + scores are an 
improvement (so it's that much percent faster) and- scores are used for 
regressions I'm seeing.

  *   K-means: Java API +22% (100 sec to 78 sec), Scala API+30% (34 seconds to 
24 seconds), Python API unchanged
  *   PageRank: minor improvement from 40 seconds to 38 seconds,+5%
  *   Sort: minor improvement, 10.8 seconds to 9.8 seconds,+10%
  *   WordCount: unchanged
  *   Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is-47%, 
other times marginally faster by 15%, something to keep an eye on
  *   Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% boosts 
for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, q57, q89. 
Five iterations, average times compared, only changing which version of Spark 
we're using



From:Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
To:Denny Lee <denny.g@gmail.com<mailto:denny.g@gmail.com>>, 
Liwei Lin <lwl...@gmail.com<mailto:lwl...@gmail.com>>, 
"dev@spark.apache.org<mailto:dev@spark.apache.org>" 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Date:18/12/2016 20:05
Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)




+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee 
<denny.g@gmail.com<mailto:denny.g@gmail.com>> wrote:
+1 (non-binding)


On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin 
<lwl...@gmail.com<mailto:lwl...@gmail.com>> wrote:
+1

Cheers,
Liwei



On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang 
<wgy...@gmail.com<mailto:wgy...@gmail.com>> wrote:
I hope https://github.com/apache/spark/pull/16252 can be fixed until release 
2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier 
<hvanhov...@databricks.com<mailto:hvanhov...@databricks.com>> wrote:
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
+1

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>:












For R we have a license field in the DESCRIPTION, and this is standard practice 
(and requirement) for R packages.







https://cran.r-project.org/doc/manuals/R-exts.html#Licensing










From: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>


Sent: Friday, December 16, 2016 9:57:15 AM


To: Reynold Xin; dev@spark.apache.org<mailto:dev@spark.apache.org>


Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)










(If you have a template for these emails, maybe update it to use https link

Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Felix Cheung

I've been scrubbing R and think we are tracking 2 issues


https://issues.apache.org/jira/browse/SPARK-19237


https://issues.apache.org/jira/browse/SPARK-19925




From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Monday, March 20, 2017 3:12:35 PM
To: dev@spark.apache.org
Subject: Outstanding Spark 2.1.1 issues

Hi Spark Developers!

As we start working on the Spark 2.1.1 release I've been looking at our 
outstanding issues still targeted for it. I've tried to break it down by 
component so that people in charge of each component can take a quick look and 
see if any of these things can/should be re-targeted to 2.2 or 2.1.2 & the 
overall list is pretty short (only 9 items - 5 if we only look at explicitly 
tagged) :)

If your working on something for Spark 2.1.1 and it doesn't show up in this 
list please speak up now :) We have a lot of issues (including "in progress") 
that are listed as impacting 2.1.0, but they aren't targeted for 2.1.1 - if 
there is something you are working in their which should be targeted for 2.1.1 
please let us know so it doesn't slip through the cracks.

The query string I used for looking at the 2.1.1 open issues is:

((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1 OR 
cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved ORDER 
BY priority DESC

None of the open issues appear to be a regression from 2.1.0, but those seem 
more likely to show up during the RC process (thanks in advance to everyone 
testing their workloads :)) & generally none of them seem to be

(Note: the cfs are for Target Version/s field)

Critical Issues:
 SQL:
  SPARK-19690 - Join a 
streaming DataFrame with a batch DataFrame may not work - PR 
https://github.com/apache/spark/pull/17052 (review in progress by zsxwing, 
currently failing Jenkins)*

Major Issues:
 SQL:
  SPARK-19035 - rand() 
function in case when cause failed - no outstanding PR (consensus on JIRA seems 
to be leaning towards it being a real issue but not necessarily everyone agrees 
just yet - maybe we should slip this?)*
 Deploy:
  SPARK-19522 - 
--executor-memory flag doesn't work in local-cluster mode - 
https://github.com/apache/spark/pull/16975 (review in progress by vanzin, but 
PR currently stalled waiting on response) *
 Core:
  SPARK-20025 - Driver fail 
over will not work, if SPARK_LOCAL* env is set. - 
https://github.com/apache/spark/pull/17357 (waiting on review) *
 PySpark:
 SPARK-19955 - Update 
run-tests to support conda [ Part of Dropping 2.6 support -- which we shouldn't 
do in a minor release -- but also fixes pip installability tests to run in 
Jenkins ]-  PR failing Jenkins (I need to poke this some more, but seems like 
2.7 support works but some other issues. Maybe slip to 2.2?)

Minor issues:
 Tests:
  SPARK-19612 - Tests 
failing with timeout - No PR per-se but it seems unrelated to the 2.1.1 
release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd 
consider explicitly targeting this for 2.2?
 PySpark:
  SPARK-19570 - Allow to 
disable hive in pyspark shell - https://github.com/apache/spark/pull/16906 PR 
exists but its difficult to add automated tests for this (although if 
SPARK-19955 gets in would 
make testing this easier) - no reviewers yet. Possible re-target?*
 Structured Streaming:
  SPARK-19613 - Flaky test: 
StateStoreRDDSuite.versioning and immutability - It's not targetted for 2.1.1 
but listed as affecting 2.1.1 - I'd consider explicitly targeting this for 2.2?
 ML:
  SPARK-19759 - 
ALSModel.predict on Dataframes : potential optimization by not using blas - No 
PR consider re-targeting unless someone has a PR waiting in the wings?

Explicitly targeted issues are marked with a *, the remaining issues are listed 
as impacting 2.1.1 and don't have a specific target version set.

Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open blocker 
in SQL( SPARK-19983 ),

Query string is:

affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = spark AND 
resolution = Unresolved AND priority = targetPriority

Continuing on for unresolved 2.1.0 issues in Major there are 163 (76 of them in 
progress), 65 Minor (26 in progress), and 9 trivial (6 in progress).

I'll be going through the 2.1.0 major issues with open PRs that impact the 
PySpark component and seeing if any of them should be

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-02 Thread Felix Cheung

-1
sorry, found an issue with SparkR CRAN check.
Opened SPARK-20197 and working on fix.

From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Friday, March 31, 2017 6:25:20 PM
To: Xiao Li
Cc: Michael Armbrust; dev@spark.apache.org
Subject: Re: [VOTE] Apache Spark 2.1.1 (RC2)

-1 (non-binding)

Python packaging doesn't seem to have quite worked out (looking at PKG-INFO the 
description is "Description: ! missing pandoc do not upload to PyPI "), 
ideally it would be nice to have this as a version we upgrade to PyPi.
Building this on my own machine results in a longer description.

My guess is that whichever machine was used to package this is missing the 
pandoc executable (or possibly pypandoc library).

On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li 
> wrote:
+1

Xiao

2017-03-30 16:09 GMT-07:00 Michael Armbrust 
>:
Please vote on releasing the following candidate as Apache Spark version 2.1.0. 
The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.1.1-rc2 
(02b165dcc2ee5245d1293a375a31660c9d4e1fa6)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1227/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.

--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-22 Thread Felix Cheung

This seems some what unique. Most notebook environment, that I know of, has a 
preset processing engine tied to the notebook; in other words when Spark is 
selected as the engine then it is always initialized, not lazily as you 
describe.

What is this notebook platform you use?

_
From: Vin J <winjos...@gmail.com<mailto:winjos...@gmail.com>>
Sent: Saturday, April 22, 2017 12:33 AM
Subject: Re: [SparkR] - options around setting up SparkSession / SparkContext
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <dev@spark.apache.org<mailto:dev@spark.apache.org>>


This is for a notebook env that has the spark session/context bootstrapped for 
the user. There are settings that are user specific so not all of those can go 
into the spark-defaults.conf - such settings need to be dynamically applied 
when creating the session/context.

In Scala/Python, I would bootstrap a "spark" handle similar to what spark-shell 
/ psyspark-shell startup scripts do. In my case the bootstrapped object could 
be of a wrapper class that took care of whatever customization I needed while 
exposing the regular  SparkSession scala/python API. The user uses this object 
as he/she would use a regular SparkSession to submit work to the Spark cluster. 
Since I am certain there is no other way for users to perform Spark work except 
to go via the bootstrapped object, I can achieve my objective of delaying 
creation of SparkSession/Context until a call comes to my custom spark object.

If I want to do the same in R, and let users write SparkR code as they normally 
would, but bootstrapping a SparkContext/Session for them, then I hit the issues 
as I explained earlier. There is no single entry point for SparkContext/Session 
in SparkR API and so to achieve lazy creation of SparkContext/session, it looks 
like the only  option is to do some trickery with the 
SparkR:::.sparkREnv$.sparkRjsc and SparkR:::.sparkREnv$.sparkRsession vars.

Regards,
Vin.

On Sat, Apr 22, 2017 at 3:33 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?

Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties<http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties>


_
From: Vin J <winjos...@gmail.com<mailto:winjos...@gmail.com>>
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>




I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made i

Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Felix Cheung

+1
there are a lot of good fixes in overall and we need a release for Python and R 
packages.

From: Holden Karau <hol...@pigscanfly.ca>
Sent: Monday, March 13, 2017 12:06:47 PM
To: Felix Cheung; Shivaram Venkataraman; dev@spark.apache.org
Subject: Should we consider a Spark 2.1.1 release?

Hi Spark Devs,

Spark 2.1 has been out since end of 
December<http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Apache-Spark-2-1-0-td20390.html>
 and we've got quite a few fixes merged for 
2.1.1<https://issues.apache.org/jira/browse/SPARK-18281?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC>.

On the Python side one of the things I'd like to see us get out into a patch 
release is a packaging fix (now merged) before we upload to PyPI & Conda, and 
we also have the normal batch of fixes like toLocalIterator for large 
DataFrames in PySpark.

I've chatted with Felix & Shivaram who seem to think the R side is looking 
close to in good shape for a 2.1.1 release to submit to CRAN (if I've 
miss-spoken my apologies). The two outstanding issues that are being tracked 
for R are SPARK-18817, SPARK-19237.

Looking at the other components quickly it seems like structured streaming 
could also benefit from a patch release.

What do others think - are there any issues people are actively targeting for 
2.1.1? Is this too early to be considering a patch release?

Cheers,

Holden
--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-05 Thread Felix Cheung

+1 (non binding)
Tested R, R package on Ubuntu and Windows, CRAN checks, manual tests with 
steaming & udf.


_
From: Denny Lee >
Sent: Monday, July 3, 2017 9:30 PM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC6)
To: Liang-Chi Hsieh >, 
>


+1 (non-binding)

On Mon, Jul 3, 2017 at 6:45 PM Liang-Chi Hsieh 
> wrote:
+1


Sameer Agarwal wrote
> +1
>
> On Mon, Jul 3, 2017 at 6:08 AM, Wenchen Fan 

> cloud0fan@

>  wrote:
>
>> +1
>>
>> On 3 Jul 2017, at 8:22 PM, Nick Pentreath 

> nick.pentreath@

> 
>> wrote:
>>
>> +1 (binding)
>>
>> On Mon, 3 Jul 2017 at 11:53 Yanbo Liang 

> ybliang8@

>  wrote:
>>
>>> +1
>>>
>>> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
>>>

> hvanhovell@

>> wrote:
>>>
 +1

 On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <


> ricardo.almeida@

>> wrote:

> +1 (non-binding)
>
> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn
> -Phive -Phive-thriftserver -Pscala-2.11 on
>
>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>
>
>
>
>
> On 1 Jul 2017 02:45, "Michael Armbrust" 

> michael@

>  wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
> https://github.com/apache/spark/tree/v2.2.0-rc6;
> (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.2.0;
> .
>
> The release files, including signatures, digests, etc. can be found
> at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-
> 2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be
> worked on immediately. Everything else please retarget to 2.3.0 or
> 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>
>
>


>>>
>>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag





-
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC6-tp21902p21914.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

Re: [SparkR] - options around setting up SparkSession / SparkContext

2017-04-21 Thread Felix Cheung

How would you handle this in Scala?

If you are adding a wrapper func like getSparkSession for Scala, and have your 
users call it, can't you do that same in SparkR? After all, while true you 
don't need a SparkSession object to call the R API, someone still needs to call 
sparkR.session() to initial the current session?

Also what Spark environment you want to customize?

Can these be set in environment variables or via spark-defaults.conf 
spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties


_
From: Vin J >
Sent: Friday, April 21, 2017 2:22 PM
Subject: [SparkR] - options around setting up SparkSession / SparkContext
To: >



I need to make an R environment available where the SparkSession/SparkContext 
needs to be setup a specific way. The user simply accesses this environment and 
executes his/her code. If the user code does not access any Spark functions, I 
do not want to create a SparkContext unnecessarily.

In Scala/Python environments, the user can't access spark without first 
referencing SparkContext / SparkSession classes. So the above (lazy and/or 
custom SparkSession/Context creation) is easily met by offering 
sparkContext/sparkSession handles to the user that are either wrappers on 
Spark's classes or have lazy evaluation semantics. This way only when the user 
accesses these handles to sparkContext/Session will the SparkSession/Context 
actually get set up without the user needing to know all the details about 
initing the SparkContext/Session.

However, achieving the same doesn't appear to be so straightforward in R. From 
what I see, executing sparkR.session(...) sets up private variables in 
SparkR:::.sparkREnv (.sparkRjsc , .sparkRsession). The way SparkR api works, a 
user doesn't need a handle to the spark session as such. Executing functions 
like so:  "df <- as.DataFrame(..)" implicitly access the private vars in 
SparkR:::.sparkREnv to get access to the sparkContext etc that are expected to 
have been created by a prior call to sparkR.session()/sparkR.init() etc.

Therefore, to inject any custom/lazy behavior into this I don't see a way 
except through having my code (that sits outside of Spark) apply a 
delayedAssign() or a makeActiveBinding( ) on SparkR:::.sparkRsession / 
.sparkRjsc  variables. This way when spark code internally references them, my 
wrapper/lazy code gets executed to do whatever I need done.

However, I am seeing some limitations of applying even this approach to SparkR 
- it will not work unless some minor changes are made in the SparkR code. But, 
before I opened a PR that would do these changes in SparkR I wanted to check if 
there was a better way to achieve this? I am far less than an R expert, and 
could be missing something here.

If you'd rather see this in a JIRA and a PR, let me know and I'll go ahead and 
open one.

Regards,
Vin.

Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-28 Thread Felix Cheung

+1

Tested R on linux and windows

Previous issue with building vignettes on windows with stackoverflow in ALS 
still reproduce but as confirmed the issue was in 2.1.0 so this isn't a 
regression (and hope for the best on CRAN..)
https://issues.apache.org/jira/browse/SPARK-20402

From: Denny Lee 
Sent: Friday, April 28, 2017 10:13:41 AM
To: Kazuaki Ishizaki; Michael Armbrust
Cc: dev@spark.apache.org
Subject: Re: [VOTE] Apache Spark 2.1.1 (RC4)

+1

On Fri, Apr 28, 2017 at 9:17 AM Kazuaki Ishizaki 
> wrote:
+1 (non-binding)

I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests for core 
have passed..

$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
$ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 package 
install
$ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl core
...
Total number of tests run: 1788
Suites: completed 198, aborted 0
Tests: succeeded 1788, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 16:30 min
[INFO] Finished at: 2017-04-29T01:02:29+09:00
[INFO] Final Memory: 54M/576M
[INFO] 

Regards,
Kazuaki Ishizaki,

From:Michael Armbrust 
>
To:"dev@spark.apache.org" 
>
Date:2017/04/27 09:30
Subject:[VOTE] Apache Spark 2.1.1 (RC4)

Please vote on releasing the following candidate as Apache Spark version 2.1.1. 
The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.1.1-rc4 
(267aca5bd5042303a718d10635bc0d1a1596853f)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1232/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Felix Cheung

Congrats!!


From: Kevin Kim (Sangwoo) 
Sent: Monday, August 7, 2017 7:30:01 PM
To: Hyukjin Kwon; dev
Cc: Bryan Cutler; Mridul Muralidharan; Matei Zaharia; Holden Karau
Subject: Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!


2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon 
>님이 작성:
Thank you all. Will do my best!

2017-08-08 8:53 GMT+09:00 Holden Karau 
>:
Congrats!

On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler 
> wrote:
Great work Hyukjin and Sameer!

On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan 
> wrote:
Congratulations Hyukjin, Sameer !

Regards,
Mridul

On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia 
> wrote:
> Hi everyone,
>
> The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as 
> committers. Join me in congratulating both of them and thanking them for 
> their contributions to the project!
>
> Matei
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: SBT / PR builder builds failing on "include an external JAR in SparkR"

2017-06-12 Thread Felix Cheung

Facepalm

I broken them - I was making changes to test files and of course Jenkins was 
only running only R tests since I was only changing R files, and everything 
passed there.

Fix is
Seq(sparkHome, "R", "pkg", "inst", "tests",

To
Seq(sparkHome, "R", "pkg", "tests", "fulltests",

And 2 instances of this.

I'm AFK right now and will push a fix as soon as I can. Sorry for the miss.

_
From: Sean Owen >
Sent: Monday, June 12, 2017 5:56 AM
Subject: SBT / PR builder builds failing on "include an external JAR in SparkR"
To: dev >


I noticed the PR builder builds are all failing with:

[info] - correctly builds R packages included in a jar with --packages !!! 
IGNORED !!!
[info] - include an external JAR in SparkR *** FAILED *** (32 milliseconds)
[info]   new java.io.File(rScriptDir).exists() was false 
(SparkSubmitSuite.scala:531)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
[info]   at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$23.apply$mcV$sp(SparkSubmitSuite.scala:531)
...

It seems to only affect the SBT builds; the Maven builds show this test is 
cancelled because R isn't installed:

- correctly builds R packages included in a jar with --packages !!! IGNORED !!!
- include an external JAR in SparkR !!! CANCELED !!!
  org.apache.spark.api.r.RUtils.isSparkRInstalled was false SparkR is not 
installed in this build. (SparkSubmitSuite.scala:528)

It seems to have started after:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/3081/

but I don't see how those changes relate.

Did anything happen to chane w.r.t. R tests or the env in the last day?

Re: [build system] rolling back R to working version

2017-06-20 Thread Felix Cheung

Thanks Shane!

From: shane knapp 
Sent: Tuesday, June 20, 2017 9:23:57 PM
To: dev
Subject: Re: [build system] rolling back R to working version

this is done...  i backported R to 3.1.1 and reinstalled all the R
packages so we're starting w/a clean slate.  the workers are all
restarted, and i re-triggered as many PRBs as i could find.

i'll check in first thing in the morning (PDT) and see how things are going.

shane

On Tue, Jun 20, 2017 at 8:31 PM, shane knapp  wrote:
> i accidentally updated R during the system update, and will be rolling
> everything back to the known working versions.
>
> again, i'm really sorry about this.  our jenkins is old, and the new
> ubuntu one is almost ready to go.  i really can't wait to shut down
> the centos boxes...  they're old and crusty.
>
> i'll give an update when i'm done -- this shouldn't take long.
>
> shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Felix Cheung

All tasks on the R QA umbrella are completed
SPARK-20512

We can close this.



_
From: Sean Owen >
Sent: Tuesday, June 6, 2017 1:16 AM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: Michael Armbrust >
Cc: >


On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2 knowing 
that they were unlikely to pass.  That said, I still think these early RCs are 
valuable. I know several users that wanted to test new features in 2.2 that 
have used them.  Now, if we would prefer to call them preview or RC0 or 
something I'd be okay with that as well.

They are valuable, I only suggest it's better to note explicitly when there are 
blockers or must-do tasks that will fail the RC. It makes a big difference to 
whether one would like to +1.

I meant more than just calling them something different. An early RC could be 
voted as a released 'preview' artifact, at the start of the notional QA period, 
with a lower bar to passing, and releasable with known issues. This encourages 
more testing. It also resolves the controversy about whether it's OK to include 
an RC in a product (separate thread).


Regarding doc updates, I don't think it is a requirement that they be voted on 
as part of the release.  Even if they are something version specific.  I think 
we have regularly updated the website with documentation that was merged after 
the release.

They're part of the source release too, as markdown, and should be voted on. 
I've never understood otherwise. Have we actually released docs and then later 
changed them, so that they don't match the release? I don't recall that, but I 
do recall updating the non-version-specific website.

Aside from the oddity of having docs generated from x.y source not match docs 
published for x.y, you want the same protections for doc source that the 
project distributes as anything else. It's not just correctness, but liability. 
The hypothetical is always that someone included copyrighted text or something 
without permission and now the project can't rely on the argument that it made 
a good-faith effort to review what it released on the site. Someone becomes 
personally liable.

These are pretty technical reasons though. More practically, what's the hurry 
to release if docs aren't done (_if_ they're not done)? It's being presented as 
normal practice, but seems quite exceptional.


I personally don't think the QA umbrella JIRAs are particularly effective, but 
I also wouldn't ban their use if others think they are.  However, I do think 
that real QA needs an RC to test, so I think it is fine that there is still 
outstanding QA to be done when an RC is cut.  For example, I plan to run a 
bunch of streaming workloads on RC4 and will vote accordingly.

QA on RCs is great (see above). The problem is, I can't distinguish between a 
JIRA that means "we must test in general", which sounds like something you too 
would ignore, and one that means "there is specific functionality we have to 
check before a release that we haven't looked at yet", which is a committer 
waving a flag that they implicitly do not want a release until resolved. I 
wouldn't +1 a release that had a Blocker software defect one of us reported.

I know I'm harping on this, but this is the one mechanism we do use 
consistently (Blocker JIRAs) to clearly communicate about issues vital to a go 
/ no-go release decision, and I think this interferes. The rest of JIRA noise 
doesn't matter much. You can see we're already resorting to secondary 
communications as a result ("anyone have any issues that need to be fixed 
before I cut another RC?" emails) because this is kind of ignored, and think 
we're swapping out a decent mechanism for worse one.

I suspect, as you do, that there's no to-do here in which case they should be 
resolved and we're still on track for release. I'd wait on +1 until then.

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Felix Cheung

Hmm, that's odd. This test would be in Jenkins too - let me double check

_
From: Nick Pentreath >
Sent: Friday, June 9, 2017 1:12 AM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: dev >

All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R 
it seems).

However, I'm seeing the following test failure on R consistently: 
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72

On Thu, 8 Jun 2017 at 08:48 Denny Lee 
> wrote:
+1 non-binding

Tested on macOS Sierra, Ubuntu 16.04
test suite includes various test cases including Spark SQL, ML, GraphFrames, 
Structured Streaming

On Wed, Jun 7, 2017 at 9:40 PM vaquar khan 
> wrote:
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
> wrote:
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive 
-Phive-thriftserver -Pscala-2.11 on

  *   Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
  *   macOS 10.12.5 Java 8 (build 1.8.0_131)

On 5 June 2017 at 21:14, Michael Armbrust 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1.

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Felix Cheung

Thanks
This was with an external package and unrelated

  >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)

As for CentOS - would it be possible to test against R older than 3.4.0? This 
is the same error reported by Nick below.

_
From: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>
Sent: Tuesday, June 13, 2017 8:02 PM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Cc: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>, Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>


For the test failure on R, I checked:


Per https://github.com/apache/spark/tree/v2.2.0-rc4,

1. Windows Server 2012 R2 / R 3.3.1 - passed 
(https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4)
2. macOS Sierra 10.12.3 / R 3.4.0 - passed
3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
4. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)


Per https://github.com/apache/spark/tree/v2.1.1,

1. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)


This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my tests and 
observations.

This is failed in Spark 2.1.1. So, it sounds not a regression although it is a 
bug that should be fixed (whether in Spark or R).


2017-06-14 8:28 GMT+09:00 Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>:
-1

Spark 2.2 is unable to read the partitioned table created by Spark 2.1 or 
earlier.

Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085

Will fix it soon.

Thanks,

Xiao Li



2017-06-13 9:39 GMT-07:00 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>>:
Re: the QA JIRAs:
Thanks for discussing them.  I still feel they are very helpful; I particularly 
notice not having to spend a solid 2-3 weeks of time QAing (unlike in earlier 
Spark releases).  One other point not mentioned above: I think they serve as a 
very helpful reminder/training for the community for rigor in development.  
Since we instituted QA JIRAs, contributors have been a lot better about adding 
in docs early, rather than waiting until the end of the cycle (though I know 
this is drawing conclusions from correlations).

I would vote in favor of the RC...but I'll wait to see about the reported 
failures.

On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
Different errors as in https://issues.apache.org/jira/browse/SPARK-20520 but 
that's also reporting R test failures.

I went back and tried to run the R tests and they passed, at least on Ubuntu 17 
/ R 3.3.


On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>> wrote:
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R 
it seems).

However, I'm seeing the following test failure on R consistently: 
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72


On Thu, 8 Jun 2017 at 08:48 Denny Lee 
<denny.g@gmail.com<mailto:denny.g@gmail.com>> wrote:
+1 non-binding

Tested on macOS Sierra, Ubuntu 16.04
test suite includes various test cases including Spark SQL, ML, GraphFrames, 
Structured Streaming


On Wed, Jun 7, 2017 at 9:40 PM vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>> wrote:
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
<ricardo.alme...@actnowib.com<mailto:ricardo.alme...@actnowib.com>> wrote:
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive 
-Phive-thriftserver -Pscala-2.11 on

  *   Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
  *   macOS 10.12.5 Java 8 (build 1.8.0_131)

On 5 June 2017 at 21:14, Michael Armbrust 
<mich...@databricks.com<mailto:mich...@databricks.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4<https://github.com/apache/spark/tree/v2.2.0-rc4> 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter<https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AN

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Felix Cheung

Thanks! Will try to setup RHEL/CentOS to test it out

_
From: Nick Pentreath <nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>>
Sent: Tuesday, June 13, 2017 11:38 PM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>, dev 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Cc: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>

Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have to 
report back later with the versions as am AFK.

R version not totally sure but again will revert asap
On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Thanks
This was with an external package and unrelated

  >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)

As for CentOS - would it be possible to test against R older than 3.4.0? This 
is the same error reported by Nick below.

_
From: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>
Sent: Tuesday, June 13, 2017 8:02 PM

Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Cc: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>, Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>

For the test failure on R, I checked:

Per https://github.com/apache/spark/tree/v2.2.0-rc4,

1. Windows Server 2012 R2 / R 3.3.1 - passed 
(https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4)
2. macOS Sierra 10.12.3 / R 3.4.0 - passed
3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
4. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)

Per https://github.com/apache/spark/tree/v2.1.1,

1. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)

This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my tests and 
observations.

This is failed in Spark 2.1.1. So, it sounds not a regression although it is a 
bug that should be fixed (whether in Spark or R).

2017-06-14 8:28 GMT+09:00 Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>:
-1

Spark 2.2 is unable to read the partitioned table created by Spark 2.1 or 
earlier.

Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085

Will fix it soon.

Thanks,

Xiao Li

2017-06-13 9:39 GMT-07:00 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>>:
Re: the QA JIRAs:
Thanks for discussing them.  I still feel they are very helpful; I particularly 
notice not having to spend a solid 2-3 weeks of time QAing (unlike in earlier 
Spark releases).  One other point not mentioned above: I think they serve as a 
very helpful reminder/training for the community for rigor in development.  
Since we instituted QA JIRAs, contributors have been a lot better about adding 
in docs early, rather than waiting until the end of the cycle (though I know 
this is drawing conclusions from correlations).

I would vote in favor of the RC...but I'll wait to see about the reported 
failures.

On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
Different errors as in https://issues.apache.org/jira/browse/SPARK-20520 but 
that's also reporting R test failures.

I went back and tried to run the R tests and they passed, at least on Ubuntu 17 
/ R 3.3.

On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>> wrote:
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R 
it seems).

However, I'm seeing the following test failure on R consistently: 
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72

On Thu, 8 Jun 2017 at 08:48 Denny Lee 
<denny.g@gmail.com<mailto:denny.g@gmail.com>> wrote:
+1 non-binding

Tested on macOS Sierra, Ubuntu 16.04
test suite includes various test cases including Spark SQL, ML, GraphFrames, 
Structured Streaming

On Wed, Jun 7, 2017 at 9:40 PM vaquar khan 
<vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>> wrote:
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
<ricardo.alme...@actnowib.com<mailto:ricardo.alme...@actnowib.com>> wrote:
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive 
-Phive-thriftserver -Pscala-

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-15 Thread Felix Cheung

Sounds good.

Think we checked and should be good to go. Appreciated.

From: Michael Armbrust <mich...@databricks.com>
Sent: Wednesday, June 14, 2017 4:51:48 PM
To: Hyukjin Kwon
Cc: Felix Cheung; Nick Pentreath; dev; Sean Owen
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)

So, it looks like 
SPARK-21085<https://issues.apache.org/jira/browse/SPARK-21085> has been fixed 
and SPARK-21093<https://issues.apache.org/jira/browse/SPARK-21093> is not a 
regression.  Last call before I cut RC5.

On Wed, Jun 14, 2017 at 2:28 AM, Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>> wrote:
Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093.

2017-06-14 17:08 GMT+09:00 Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>>:
For a shorter reproducer ...

df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
collect(gapply(df, "a", function(key, x) { x }, schema(df)))

And running the below multiple times (5~7):

collect(gapply(df, "a", function(key, x) { x }, schema(df)))

looks occasionally throwing an error.

I will leave here and probably explain more information if a JIRA is open. This 
does not look a regression anyway.

2017-06-14 16:22 GMT+09:00 Hyukjin Kwon 
<gurwls...@gmail.com<mailto:gurwls...@gmail.com>>:

Per https://github.com/apache/spark/tree/v2.1.1,

1. CentOS 7.2.1511 / R 3.3.3 - this test hangs.

I messed it up a bit while downgrading the R to 3.3.3 (It was an actual machine 
not a VM) so it took me a while to re-try this.
I re-built this again and checked the R version is 3.3.3 at least. I hope this 
one could double checked.

Here is the self-reproducer:

irisDF <- suppressWarnings(createDataFrame (iris))
schema <-  structType(structField("Sepal_Length", "double"), structField("Avg", 
"double"))
df4 <- gapply(
  cols = "Sepal_Length",
  irisDF,
  function(key, x) {
y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE)
  },
  schema)
collect(df4)

2017-06-14 16:07 GMT+09:00 Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>:
Thanks! Will try to setup RHEL/CentOS to test it out

_____
From: Nick Pentreath <nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>>
Sent: Tuesday, June 13, 2017 11:38 PM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>, dev 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>

Cc: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>

Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have to 
report back later with the versions as am AFK.

R version not totally sure but again will revert asap
On Wed, 14 Jun 2017 at 05:09, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Thanks
This was with an external package and unrelated

  >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)

As for CentOS - would it be possible to test against R older than 3.4.0? This 
is the same error reported by Nick below.

_
From: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>
Sent: Tuesday, June 13, 2017 8:02 PM

Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
To: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>
Cc: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>, Nick Pentreath 
<nick.pentre...@gmail.com<mailto:nick.pentre...@gmail.com>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>

For the test failure on R, I checked:

Per https://github.com/apache/spark/tree/v2.2.0-rc4,

1. Windows Server 2012 R2 / R 3.3.1 - passed 
(https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4)
2. macOS Sierra 10.12.3 / R 3.4.0 - passed
3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning 
(https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
4. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)

Per https://github.com/apache/spark/tree/v2.1.1,

1. CentOS 7.2.1511 / R 3.4.0 - reproduced 
(https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)

This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my tests and 
observations.

This is failed in Spark 2.1.1. So, it sounds not a regression although it is a 
bug that should be fixed (whether in Spark or R).

2017-06-14 8:28 GMT+09:00 Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm.

Re: Spark 2.2.0 or Spark 2.3.0?

2017-05-02 Thread Felix Cheung

Yes 2.2.0

From: kant kodali 
Sent: Monday, May 1, 2017 10:43:44 PM
To: dev
Subject: Spark 2.2.0 or Spark 2.3.0?

Hi All,

If I understand the Spark standard release process correctly. It looks like the 
official release is going to be sometime end of this month and it is going to 
be 2.2.0 right (not 2.3.0)? I am eagerly looking for Spark 2.2.0 because of the 
"update mode" option in Spark Streaming. Please correct me if I am wrong.

Thanks!

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-14 Thread Felix Cheung

+1 tested SparkR package on Windows, r-hub, Ubuntu.

_
From: Sean Owen >
Sent: Thursday, September 14, 2017 3:12 PM
Subject: Re: [VOTE] Spark 2.1.2 (RC1)
To: Holden Karau >, 
>

+1
Very nice. The sigs and hashes look fine, it builds fine for me on Debian 
Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests.

Yes as you say, no outstanding issues except for this which doesn't look 
critical, as it's not a regression.

SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs

On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.2. 
The vote is open until Friday September 22nd at 18:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is 
v2.1.2-rc1 
(6f470323a0363656999dd36cb33f528afe627c12)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1248/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1. That being said if there is 
something which is a regression form 2.1.1 that has not been correctly targeted 
please ping a committer to help target the issue (you can see the open issues 
listed as impacting Spark 2.1.1 & 
2.1.2)

What are the unresolved issues targeted for 
2.1.2?

At the time of the writing, there is one in progress major issue 
SPARK-21985, I believe 
Andrew Ray & HyukjinKwon are looking into this one.

--
Twitter: https://twitter.com/holdenkarau

Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-15 Thread Felix Cheung

Yes ;)

From: Xiao Li <gatorsm...@gmail.com>
Sent: Friday, September 15, 2017 2:22:03 PM
To: Holden Karau
Cc: Ryan Blue; Denny Lee; Felix Cheung; Sean Owen; dev@spark.apache.org
Subject: Re: [VOTE] Spark 2.1.2 (RC1)

Sorry, this release candidate is 2.1.2. The issue is in 2.2.1.

2017-09-15 14:21 GMT-07:00 Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>>:
-1

See the discussion in https://github.com/apache/spark/pull/19074

Xiao

2017-09-15 12:28 GMT-07:00 Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>:
That's a good question, I built the release candidate however the Jenkins 
scripts don't take a parameter for configuring who signs them rather it always 
signs them with Patrick's key. You can see this from previous releases which 
were managed by other folks but still signed by Patrick.

On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
<rb...@netflix.com<mailto:rb...@netflix.com>> wrote:
The signature is valid, but why was the release signed with Patrick Wendell's 
private key? Did Patrick build the release candidate?

rb

On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
<denny.g@gmail.com<mailto:denny.g@gmail.com>> wrote:
+1 (non-binding)

On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
+1 tested SparkR package on Windows, r-hub, Ubuntu.

_
From: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>
Sent: Thursday, September 14, 2017 3:12 PM
Subject: Re: [VOTE] Spark 2.1.2 (RC1)
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>

+1
Very nice. The sigs and hashes look fine, it builds fine for me on Debian 
Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and passes tests.

Yes as you say, no outstanding issues except for this which doesn't look 
critical, as it's not a regression.

SPARK-21985 PySpark PairDeserializer is broken for double-zipped RDDs

On Thu, Sep 14, 2017 at 7:47 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.2. 
The vote is open until Friday September 22nd at 18:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is 
v2.1.2-rc1<https://github.com/apache/spark/tree/v2.1.2-rc1> 
(6f470323a0363656999dd36cb33f528afe627c12)

List of JIRA tickets resolved in this release can be found with this 
filter.<https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1248/

The documentation corresponding to this release can be found at:
https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1. That being said if there is 
something which is a regression form 2.1.1 that has not been correctly targeted 
please ping a committer to help target the issue (you can see the open issues 
listed as impacting Spark 2.1.1 & 
2.1.2<https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>)

What are the unresolved issues targeted for 
2.1.2<https://issues.apache.org/jira/browse/SPARK-21985?jql

Re: Nightly builds for master branch failed

2017-10-04 Thread Felix Cheung

Hmm, sounds like some sort of corruption of the maven directory on the Jenkins 
box...

From: Liwei Lin 
Sent: Wednesday, October 4, 2017 6:52:54 PM
To: Spark dev list
Subject: Nightly builds for master branch failed

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/

Nightly builds for master branch failed due to:

[error] error: error reading 
/home/jenkins/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.6.7.1/jackson-databind-2.6.7.1.jar;
 zip file is empty

Can we get it fixed please? Thanks!

Cheers,
Liwei

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-04 Thread Felix Cheung

To be sure, this is only for JIRA and not for github PR, right?

If then +1 but I think the access control on JIRA does not necessarily match 
the committer list, and is manually maintained, last I hear.

From: Sean Owen 
Sent: Wednesday, October 4, 2017 7:51:37 PM
To: Dongjoon Hyun
Cc: dev
Subject: Re: Disabling Closed -> Reopened transition for non-committers

Although I assume we could get an account suspended if it started opening spam 
issues, yes we default to letting anyone open issues, and potentially abusing 
it. That much is the right default and I don't see any policy tweak that stops 
that.

I see several INFRA tickets asking to *allow* the Closed -> Reopened 
transition, which suggests it's not the default. 
https://issues.apache.org/jira/browse/INFRA-11857?jql=project%20%3D%20INFRA%20AND%20text%20~%20%22reopen%20JIRA%22

I'm accustomed to Closed being a final state that nobody can reopen as a matter 
of workflow -- the idea being that anything else should be a new discussion if 
the current issue was deemed formally done.

Spark pretty much leaves all issues in "Resolved" status which can still be 
reopened, and I think that's right. Although I'd like to limit all reopening to 
committers, it isn't that important.

Being able to move a JIRA to Closed permanently seems useful, as it doesn't 
interfere with any normal workflow, doesn't actually prevent a new issue from 
succeeding it in normal usage, and gives another tool to limit a specific kind 
of abuse.

On Thu, Oct 5, 2017 at 3:28 AM Dongjoon Hyun 
> wrote:
It can stop reopening, but new JIRA issues with duplicate content will be 
created intentionally instead.

Is that policy (privileged reopening) used in other Apache communities for that 
purpose?

On Wed, Oct 4, 2017 at 7:06 PM, Sean Owen 
> wrote:
We have this problem occasionally, where a disgruntled user continually reopens 
an issue after it's closed.

https://issues.apache.org/jira/browse/SPARK-21999

(Feel free to comment on this one if anyone disagrees)

Regardless of that particular JIRA, I'd like to disable to Closed -> Reopened 
transition for non-committers: https://issues.apache.org/jira/browse/INFRA-15221

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Felix Cheung

+1

Tested SparkR package manually on multiple platforms and checked different 
Hadoop release jar.

And previously tested the last RC on different R releases (see the last RC vote 
thread)

I found some differences in bin release jars created by the different options 
when running the make-release script, created this JIRA to track
https://issues.apache.org/jira/browse/SPARK-22202

I've checked to confirm these exist in 2.1.1 release so this isn't a 
regression, and hence my +1.

btw, I think we need to update this file for the new keys used in signing this 
release https://www.apache.org/dist/spark/KEYS

_
From: Liwei Lin >
Sent: Wednesday, October 4, 2017 6:51 PM
Subject: Re: [VOTE] Spark 2.1.2 (RC4)
To: Spark dev list >

+1 (non-binding)

Cheers,
Liwei

On Wed, Oct 4, 2017 at 4:03 PM, Nick Pentreath 
> wrote:
Ah right! Was using a new cloud instance and didn't realize I was logged in as 
root! thanks

On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin 
> wrote:
Maybe you're running as root (or the admin account on your OS)?

On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath
> wrote:
> Hmm I'm consistently getting this error in core tests:
>
> - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
>   2 was not equal to 1 (FsHistoryProviderSuite.scala:146)
>
>
> Anyone else? Any insight? Perhaps it's my set up.
>
>>>
>>>
>>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau 
>>> > wrote:

 Please vote on releasing the following candidate as Apache Spark version
 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and passes 
 if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 2.1.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v2.1.2-rc4
 (2abaea9e40fce81cd4626498e0f5c28a70917499)

 List of JIRA tickets resolved in this release can be found with this
 filter.

 The release files, including signatures, digests, etc. can be found at:
 https://home.apache.org/~holden/spark-2.1.2-rc4-bin/

 Release artifacts are signed with a key from:
 https://people.apache.org/~holden/holdens_keys.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1252

 The documentation corresponding to this release can be found at:
 https://people.apache.org/~holden/spark-2.1.2-rc4-docs/

 FAQ

 How can I help test this release?

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala you
 can add the staging repository to your projects resolvers and test with the
 RC (make sure to clean up the artifact cache before/after so you don't end
 up building with a out of date RC going forward).

 What should happen to JIRA tickets still targeting 2.1.2?

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.1.3.

 But my bug isn't fixed!??!

 In order to make timely releases, we will typically not hold the release
 unless the bug in question is a regression from 2.1.1. That being said if
 there is something which is a regression form 2.1.1 that has not been
 correctly targeted please ping a committer to help target the issue (you 
 can
 see the open issues listed as impacting Spark 2.1.1 & 2.1.2)

 What are the unresolved issues targeted for 2.1.2?

 At this time there are no open unresolved issues.

 Is there anything different about this release?

 This is the first release in awhile not built on the AMPLAB Jenkins.
 This is good because it means future releases can more easily be built and
 signed securely (and I've been updating the documentation in
 https://github.com/apache/spark-website/pull/66 as I progress), however the
 chances of a mistake are higher with any change like this. If there
 something you normally take for granted as correct when checking a release,
 please double check this time :)

 Should I be committing code to branch-2.1?

 Thanks for asking!

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Felix Cheung

Thanks Nick, Hyukjin. Yes this seems to be a longer standing issue on RHEL with 
respect to forking.

From: Nick Pentreath 
Sent: Friday, October 6, 2017 6:16:53 AM
To: Hyukjin Kwon
Cc: dev
Subject: Re: [VOTE] Spark 2.1.2 (RC4)

Ah yes - I recall that it was fixed. Forgot it was for 2.3.0

My +1 vote stands.

On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon 
> wrote:
Hi Nick,

I believe that R test failure is due to SPARK-21093, at least the error message 
looks the same, and that is fixed from 2.3.0. This was not backported because I 
and reviewers were worried as that fixed a very core to SparkR (even, it was 
reverted once even after very close look by some reviewers).

I asked Michael to note this as a known issue in 
https://spark.apache.org/releases/spark-release-2-2-0.html#known-issues before 
due to this reason.
I believe It should be fine and probably we should note if possible. I believe 
this should not be a regression anyway as, if I understood correctly, it was 
there from the very first place.

Thanks.

2017-10-06 21:20 GMT+09:00 Nick Pentreath 
>:
Checked sigs & hashes.

Tested on RHEL
build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
Python tests passed

I ran R tests and am getting some failures: 
https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to 
recall similar issues on a previous release but I thought it was fixed).

I re-ran R tests on an Ubuntu box to double check and they passed there.

So I'd still +1 the release

Perhaps someone can take a look at the R failures on RHEL just in case though.

On Fri, 6 Oct 2017 at 05:58 vaquar khan 
> wrote:
+1 (non binding ) tested on Ubuntu ,all test case  are passed.

Regards,
Vaquar khan

On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon 
> wrote:
+1 too.

On 6 Oct 2017 10:49 am, "Reynold Xin" 
> wrote:
+1

On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.2. 
The vote is open until Saturday October 7th at 9:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is 
v2.1.2-rc4 
(2abaea9e40fce81cd4626498e0f5c28a70917499)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc4-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1252

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc4-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1. That being said if there is 
something which is a regression form 2.1.1 that has not been correctly targeted 
please ping a committer to help target the issue (you can see the open issues 
listed as impacting Spark 2.1.1 & 
2.1.2)

What are the unresolved issues targeted for

Re: Nightly builds for master branch failed

2017-10-05 Thread Felix Cheung

Thanks Shane!

From: shane knapp <skn...@berkeley.edu>
Sent: Thursday, October 5, 2017 9:14:54 AM
To: Felix Cheung
Cc: Liwei Lin; Spark dev list
Subject: Re: Nightly builds for master branch failed

yep, it was a corrupted jar on amp-jenkins-worker-01.  i grabbed a new one from 
maven.org<http://maven.org> and kicked off a fresh build.

On Thu, Oct 5, 2017 at 9:03 AM, shane knapp 
<skn...@berkeley.edu<mailto:skn...@berkeley.edu>> wrote:
yep, looking now.

On Wed, Oct 4, 2017 at 10:04 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Hmm, sounds like some sort of corruption of the maven directory on the Jenkins 
box...

From: Liwei Lin <lwl...@gmail.com<mailto:lwl...@gmail.com>>
Sent: Wednesday, October 4, 2017 6:52:54 PM
To: Spark dev list
Subject: Nightly builds for master branch failed

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/

Nightly builds for master branch failed due to:

[error] error: error reading 
/home/jenkins/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.6.7.1/jackson-databind-2.6.7.1.jar<http://2.6.7.1/jackson-databind-2.6.7.1.jar>;
 zip file is empty

Can we get it fixed please? Thanks!

Cheers,
Liwei

Re: Putting Kafka 0.8 behind an (opt-in) profile

2017-09-05 Thread Felix Cheung

+1


From: Cody Koeninger 
Sent: Tuesday, September 5, 2017 8:12:07 AM
To: Sean Owen
Cc: dev
Subject: Re: Putting Kafka 0.8 behind an (opt-in) profile

+1 to going ahead and giving a deprecation warning now

On Tue, Sep 5, 2017 at 6:39 AM, Sean Owen  wrote:
> On the road to Scala 2.12, we'll need to make Kafka 0.8 support optional in
> the build, because it is not available for Scala 2.12.
>
> https://github.com/apache/spark/pull/19134  adds that profile. I mention it
> because this means that Kafka 0.8 becomes "opt-in" and has to be explicitly
> enabled, and that may have implications for downstream builds.
>
> Yes, we can add true. It however only has
> effect when no other profiles are set, which makes it more deceptive than
> useful IMHO. (We don't use it otherwise.)
>
> Reviewers may want to check my work especially as regards the Python test
> support and SBT build.
>
>
> Another related question is: when is 0.8 support deprecated, removed? It
> seems sudden to remove it in 2.3.0. Maybe deprecation is in order. The
> driver is that Kafka 0.11 and 1.0 will possibly require yet another variant
> of streaming support (not sure yet), and 3 versions is too many. Deprecating
> now opens more options sooner.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: 2.1.2 maintenance release?

2017-09-08 Thread Felix Cheung

+1 on both 2.1.2 and 2.2.1

And would try to help and/or wrangle the release if needed.

(Note: trying to backport a few changes to branch-2.1 right now)

From: Sean Owen 
Sent: Friday, September 8, 2017 12:05:28 AM
To: Holden Karau; dev
Subject: Re: 2.1.2 maintenance release?

Let's look at the standard ASF guidance, which actually surprised me when I 
first read it:

https://www.apache.org/foundation/voting.html

VOTES ON PACKAGE RELEASES
Votes on whether a package is ready to be released use majority approval -- 
i.e. at least three PMC members must vote affirmatively for release, and there 
must be more positive than negative votes. Releases may not be vetoed. 
Generally the community will cancel the release vote if anyone identifies 
serious problems, but in most cases the ultimate decision, lies with the 
individual serving as release manager. The specifics of the process may vary 
from project to project, but the 'minimum quorum of three +1 votes' rule is 
universal.

PMC votes on it, but no vetoes allowed, and the release manager makes the final 
call. Not your usual vote! doesn't say the release manager has to be part of 
the PMC though it's the role with most decision power. In practice I can't 
imagine it's a problem, but we could also just have someone on the PMC 
technically be the release manager even as someone else is really operating the 
release.

The goal is, really, to be able to put out maintenance releases with important 
fixes. Secondly, to ramp up one or more additional people to perform the 
release steps. Maintenance releases ought to be the least controversial 
releases to decide.

Thoughts on kicking off a release for 2.1.2 to see how it goes?

Although someone can just start following the steps, I think it will certainly 
require some help from Michael, who's run the last release, to clarify parts of 
the process or possibly provide an essential credential to upload artifacts.

On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
> wrote:
I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after that) if 
people are ok with a committer / me running the release process rather than a 
full PMC member.

Re: 2.1.2 maintenance release?

2017-09-11 Thread Felix Cheung

Hi - what are the next steps?
Pending changes are pushed and checked that there is no open JIRA targeting 
2.1.2 and 2.2.1

_
From: Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>>
Sent: Friday, September 8, 2017 9:27 AM
Subject: Re: 2.1.2 maintenance release?
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>>, dev 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>

+1 as well. We should make a few maintenance releases.

On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
+1 on both 2.1.2 and 2.2.1

And would try to help and/or wrangle the release if needed.

(Note: trying to backport a few changes to branch-2.1 right now)

From: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>
Sent: Friday, September 8, 2017 12:05:28 AM
To: Holden Karau; dev
Subject: Re: 2.1.2 maintenance release?

Let's look at the standard ASF guidance, which actually surprised me when I 
first read it:

https://www.apache.org/foundation/voting.html

VOTES ON PACKAGE RELEASES
Votes on whether a package is ready to be released use majority approval -- 
i.e. at least three PMC members must vote affirmatively for release, and there 
must be more positive than negative votes. Releases may not be vetoed. 
Generally the community will cancel the release vote if anyone identifies 
serious problems, but in most cases the ultimate decision, lies with the 
individual serving as release manager. The specifics of the process may vary 
from project to project, but the 'minimum quorum of three +1 votes' rule is 
universal.

PMC votes on it, but no vetoes allowed, and the release manager makes the final 
call. Not your usual vote! doesn't say the release manager has to be part of 
the PMC though it's the role with most decision power. In practice I can't 
imagine it's a problem, but we could also just have someone on the PMC 
technically be the release manager even as someone else is really operating the 
release.

The goal is, really, to be able to put out maintenance releases with important 
fixes. Secondly, to ramp up one or more additional people to perform the 
release steps. Maintenance releases ought to be the least controversial 
releases to decide.

Thoughts on kicking off a release for 2.1.2 to see how it goes?

Although someone can just start following the steps, I think it will certainly 
require some help from Michael, who's run the last release, to clarify parts of 
the process or possibly provide an essential credential to upload artifacts.

On Thu, Sep 7, 2017 at 11:59 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after that) if 
people are ok with a committer / me running the release process rather than a 
full PMC member.

Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-29 Thread Felix Cheung

-1

(Sorry) spark-2.1.2-bin-hadoop2.7.tgz is missing the R directory, not sure why 
yet.

Tested on multiple platform as source package, (against 2.1.1 jar) seemed fine 
except this WARNING on R-devel

* checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
  Code: function(what, pos = 2L, name = deparse(substitute(what),
 backtick = FALSE), warn.conflicts = TRUE)
  Docs: function(what, pos = 2L, name = deparse(substitute(what)),
 warn.conflicts = TRUE)
  Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
deparse(substitute(what))

Checked the latest release R 3.4.1 and the signature change wasn't there. This 
likely indicated an upcoming change in the next R release that could insur this 
new warning when we attempt to publish the package.

Not sure what we can do now since we work with multiple versions of R and they 
will have different signatures then.

From: Luciano Resende 
Sent: Thursday, September 28, 2017 10:29:18 PM
To: Holden Karau
Cc: dev@spark.apache.org
Subject: Re: [VOTE] Spark 2.1.2 (RC2)

+1 (non-binding)

Minor comments:
The apache infra has a staging repository to add release candidates, and it 
might be better/simpler to use that instead of home.a.o. See 
https://dist.apache.org/repos/dist/dev/spark/.



On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.2. 
The vote is open until Wednesday October 4th at 23:59 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.2
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is 
v2.1.2-rc2 
(fabbb7f59e47590114366d14e15fbbff8c88593c)

List of JIRA tickets resolved in this release can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
https://home.apache.org/~holden/spark-2.1.2-rc2-bin/

Release artifacts are signed with a key from:
https://people.apache.org/~holden/holdens_keys.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1251

The documentation corresponding to this release can be found at:
https://people.apache.org/~holden/spark-2.1.2-rc2-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

What should happen to JIRA tickets still targeting 2.1.2?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.3.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1. That being said if there is 
something which is a regression form 2.1.1 that has not been correctly targeted 
please ping a committer to help target the issue (you can see the open issues 
listed as impacting Spark 2.1.1 & 
2.1.2)

What are the unresolved issues targeted for 
2.1.2?

At this time there are no open unresolved issues.

Is there anything different about this release?

This is the first release in awhile not built on the AMPLAB Jenkins. This is 
good because it means future releases can more easily be built and signed 
securely (and I've been updating the documentation in 
https://github.com/apache/spark-website/pull/66 as I progress), however the 
chances of a mistake are higher with any change like this. If there something 
you normally take for granted as correct when checking a release, please double 
check this time :)

Should I be committing code to branch-2.1?

Thanks for asking!

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Felix Cheung

+1 on this and like the suggestion of type in string form.

Would it be correct to assume there will be data type check, for example the 
returned pandas data frame column data types match what are specified. We have 
seen quite a bit of issues/confusions with that in R.

Would it make sense to have a more generic decorator name so that it could also 
be useable for other efficient vectorized format in the future? Or do we 
anticipate the decorator to be format specific and will have more in the future?

From: Reynold Xin 
Sent: Friday, September 1, 2017 5:16:11 AM
To: Takuya UESHIN
Cc: spark-dev
Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

Ok, thanks.

+1 on the SPIP for scope etc

On API details (will deal with in code reviews as well but leaving a note here 
in case I forget)

1. I would suggest having the API also accept data type specification in string 
form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match at 
runtime.

On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
> wrote:
Yes, the aggregation is out of scope for now.
I think we should continue discussing the aggregation at JIRA and we will be 
adding those later separately.

Thanks.

On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
> wrote:
Is the idea aggregate is out of scope for the current effort and we will be 
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> wrote:
Hi all,

We've been discussing to support vectorized UDFs in Python and we almost got a 
consensus about the APIs, so I'd like to summarize and call for a vote.

Note that this vote should focus on APIs for vectorized UDFs, not APIs for 
vectorized UDAFs or Window operations.

https://issues.apache.org/jira/browse/SPARK-21190

Proposed API

We introduce a @pandas_udf decorator (or annotation) to define vectorized UDFs 
which takes one or more pandas.Series or one integer value meaning the length 
of the input value for 0-parameter UDFs. The return value should be 
pandas.Series of the specified type and the length of the returned value should 
be the same as input value.

We can define vectorized UDFs as:

  @pandas_udf(DoubleType())
  def plus(v1, v2):
  return v1 + v2

or we can define as:

  plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())

We can use it similar to row-by-row UDFs:

  df.withColumn('sum', plus(df.v1, df.v2))

As for 0-parameter UDFs, we can define and use as:

  @pandas_udf(LongType())
  def f0(size):
  return pd.Series(1).repeat(size)

  df.select(f0())

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical 
reasons.

Thanks!

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: Updates on migration guides

2017-08-31 Thread Felix Cheung

+1

think we do migration guide changes for ML and R in separate JIRA/PR/commit but 
we definition should have it updated before the release.

From: linguin@gmail.com 
Sent: Wednesday, August 30, 2017 8:27:17 AM
To: Dongjoon Hyun
Cc: Xiao Li; u...@spark.apache.org
Subject: Re: Updates on migration guides

+1

2017/08/31 0:02、Dongjoon Hyun 
> のメッセージ:

+1

On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li 
> wrote:
Hi, Devs,

Many questions from the open source community are actually caused by the 
behavior changes we made in each release. So far, the migration guides (e.g., 
https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
 were not being properly updated. In the last few releases, multiple behavior 
changes are not documented in migration guides and even release notes. I 
propose to do the document updates in the same PRs that introduce the behavior 
changes. If the contributors can't make it, the committers who merge the PRs 
need to do it instead. We also can create a dedicated page for migration guides 
of all the components. Hopefully, this can assist the migration efforts.

Thanks,

Xiao Li

Re: Cutting the RC for Spark 2.2.1 release

2017-11-13 Thread Felix Cheung

Quick update:

We merged 6 fixes Friday and 7 fixes today (thanks!), since some are 
hand-merged I’m waiting for clean builds from Jenkins and test passes. As of 
now it looks like we need to take one more fix for Scala 2.10.

With any luck we should be tagging for build tomorrow morning (PT).

There should not be any issue targetting 2.2.1 except for SPARK-22042. As it is 
not a regression and it seems it might take a while, we won’t be blocking the 
release.

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Wednesday, November 8, 2017 3:57 PM
Subject: Cutting the RC for Spark 2.2.1 release
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>


Hi!

As we are closing down on the few known issues I think we are ready to tag and 
cut the 2.2.1 release.

If you are aware of any issue that you think should go into this release please 
feel free to ping me and mark the JIRA as targeting 2.2.1. I will be scrubbing 
JIRA in the next few days.

So unless we hear otherwise, I’m going to tag and build the RC starting 
Saturday EOD (PT). Please be patient since I’m going to be new at this :) but 
will keep the dev@ posted for any update.

Yours
RM for 2.2.1

[VOTE] Spark 2.2.1 (RC1)

2017-11-14 Thread Felix Cheung

Please vote on releasing the following candidate as Apache Spark version
2.2.1. The vote is open until Monday November 20, 2017 at 23:00 UTC and
passes if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.2.1

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/


The tag to be voted on is v2.2.1-rc1
https://github.com/apache/spark/tree/v2.2.1-rc1
(41116ab7fca46db7255b01e8727e2e5d571a3e35)

List of JIRA tickets resolved in this release can be found here
https://issues.apache.org/jira/projects/SPARK/versions/12340470


The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1256/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-docs/_site/index.html


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

*What should happen to JIRA tickets still targeting 2.2.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.2.2.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.2.0. That being said if
there is something which is a regression form 2.2.0 that has not been
correctly targeted please ping a committer to help target the issue (you
can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here

.

What are the unresolved issues targeted for 2.2.1

?

At the time of the writing, there is one resolved SPARK-22471
 would help stability,
and one in progress on joins SPARK-22042

Re: [VOTE] Spark 2.2.1 (RC1)

2017-11-15 Thread Felix Cheung

Thanks Xiao, please continue to merge them to branch-2.2 and tag with
TargetVersion 2.2.2

They look to be fairly isolated, please continue to test this RC1 as much
as possible and I think we should hold on rolling another RC till Sunday.


On Wed, Nov 15, 2017 at 2:15 PM Xiao Li <gatorsm...@gmail.com> wrote:

> Another issue https://issues.apache.org/jira/browse/SPARK-22479 is also
> critical for security. We should also merge it to 2.2.1?
>
> 2017-11-15 9:12 GMT-08:00 Xiao Li <gatorsm...@gmail.com>:
>
>> Hi, Felix,
>>
>> https://issues.apache.org/jira/browse/SPARK-22469
>>
>> Maybe also include this regression of 2.2? It works in 2.1
>>
>> Thanks,
>>
>> Xiao
>>
>>
>>
>> 2017-11-14 22:25 GMT-08:00 Felix Cheung <felixche...@apache.org>:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.2.1. The vote is open until Monday November 20, 2017 at 23:00 UTC and
>>> passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.1
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v2.2.1-rc1
>>> https://github.com/apache/spark/tree/v2.2.1-rc1
>>> (41116ab7fca46db7255b01e8727e2e5d571a3e35)
>>>
>>> List of JIRA tickets resolved in this release can be found here
>>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>>
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1256/
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-docs/_site/index.html
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.2.2.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.2.0. That being said if
>>> there is something which is a regression form 2.2.0 that has not been
>>> correctly targeted please ping a committer to help target the issue (you
>>> can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here
>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.2.1%20OR%20affectedVersion%20%3D%202.2.2)>
>>> .
>>>
>>> What are the unresolved issues targeted for 2.2.1
>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.1>
>>> ?
>>>
>>> At the time of the writing, there is one resolved SPARK-22471
>>> <https://issues.apache.org/jira/browse/SPARK-22471> would help
>>> stability, and one in progress on joins SPARK-22042
>>> <https://issues.apache.org/jira/browse/SPARK-22042>
>>>
>>>
>>>
>>>
>>
>

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-28 Thread Felix Cheung

+1

Thanks Sean. Please vote!

Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel
and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and
no leaked files (.cache removed, /tmp clean)


On Sun, Nov 26, 2017 at 11:55 AM Sean Owen <so...@cloudera.com> wrote:

> Yes it downloads recent releases. The test worked for me on a second try,
> so I suspect a bad mirror. If this comes up frequently we can just add
> retry logic, as the closer.lua script will return different mirrors each
> time.
>
> The tests all pass for me on the latest Debian, so +1 for this release.
>
> (I committed the change to set -Xss4m for tests consistently, but this
> shouldn't block a release.)
>
>
> On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> Ah sorry digging through the history it looks like this is changed
>> relatively recently and should only download previous releases.
>>
>> Perhaps we are intermittently hitting a mirror that doesn’t have the
>> files?
>>
>>
>>
>> https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae
>>
>>
>> On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> Thanks Sean.
>>>
>>> For the second one, it looks like the
>>>  HiveExternalCatalogVersionsSuite is trying to download the release tgz
>>> from the official Apache mirror, which won’t work unless the release is
>>> actually, released?
>>>
>>> val preferredMirror =
>>> Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q",
>>> "-O", "-").!!.trim
>>> val url = s"
>>> $preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
>>>
>>> It’s proabbly getting an error page instead.
>>>
>>>
>>> On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> I hit the same StackOverflowError as in the previous RC test, but,
>>>> pretty sure this is just because the increased thread stack size JVM flag
>>>> isn't applied consistently. This seems to resolve it:
>>>>
>>>> https://github.com/apache/spark/pull/19820
>>>>
>>>> This wouldn't block release IMHO.
>>>>
>>>>
>>>> I am currently investigating this failure though -- seems like the
>>>> mechanism that downloads Spark tarballs needs fixing, or updating, in the
>>>> 2.2 branch?
>>>>
>>>> HiveExternalCatalogVersionsSuite:
>>>>
>>>> gzip: stdin: not in gzip format
>>>>
>>>> tar: Child returned status 1
>>>>
>>>> tar: Error is not recoverable: exiting now
>>>>
>>>> *** RUN ABORTED ***
>>>>
>>>>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
>>>> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or 
>>>> directory
>>>>
>>>> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.1. The vote is open until Friday December 1, 2017 at
>>>>> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes
>>>>> are cast.
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.1
>>>>>
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>>
>>>>> The tag to be voted on is v2.2.1-rc2
>>>>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>>>>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found here
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>>>>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-25 Thread Felix Cheung

Thanks Sean.

For the second one, it looks like the  HiveExternalCatalogVersionsSuite is
trying to download the release tgz from the official Apache mirror, which
won’t work unless the release is actually, released?

val preferredMirror =
Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q", "
-O", "-").!!.trim
val url = s"
$preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"

It’s proabbly getting an error page instead.


On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:

> I hit the same StackOverflowError as in the previous RC test, but, pretty
> sure this is just because the increased thread stack size JVM flag isn't
> applied consistently. This seems to resolve it:
>
> https://github.com/apache/spark/pull/19820
>
> This wouldn't block release IMHO.
>
>
> I am currently investigating this failure though -- seems like the
> mechanism that downloads Spark tarballs needs fixing, or updating, in the
> 2.2 branch?
>
> HiveExternalCatalogVersionsSuite:
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> *** RUN ABORTED ***
>
>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or directory
>
> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC
>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.2.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>>
>> The tag to be voted on is v2.2.1-rc2
>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>
>> List of JIRA tickets resolved in this release can be found here
>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1257/
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-docs/_site/index.html
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you can
>> add the staging repository to your projects resolvers and test with the RC
>> (make sure to clean up the artifact cache before/after so you don't end up
>> building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.2.1?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.2.2.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.2.0. That being said if
>> there is something which is a regression form 2.2.0 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.2.1%20OR%20affectedVersion%20%3D%202.2.2)>
>> .
>>
>> *What are the unresolved issues targeted for 2.2.1
>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.1>?*
>>
>> At the time of the writing, there is one intermited failure SPARK-20201
>> <https://issues.apache.org/jira/browse/SPARK-20201> which we are
>> tracking since 2.2.0.
>>
>>

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-25 Thread Felix Cheung

Ah sorry digging through the history it looks like this is changed
relatively recently and should only download previous releases.

Perhaps we are intermittently hitting a mirror that doesn’t have the files?


https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d348f1d5c9ae


On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung <felixche...@apache.org>
wrote:

> Thanks Sean.
>
> For the second one, it looks like the  HiveExternalCatalogVersionsSuite is
> trying to download the release tgz from the official Apache mirror, which
> won’t work unless the release is actually, released?
>
> val preferredMirror =
> Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q",
> "-O", "-").!!.trim
> val url = s"
> $preferredMirror/spark/spark-$version/spark-$version-bin-hadoop2.7.tgz"
>
> It’s proabbly getting an error page instead.
>
>
> On Sat, Nov 25, 2017 at 10:28 AM Sean Owen <so...@cloudera.com> wrote:
>
>> I hit the same StackOverflowError as in the previous RC test, but, pretty
>> sure this is just because the increased thread stack size JVM flag isn't
>> applied consistently. This seems to resolve it:
>>
>> https://github.com/apache/spark/pull/19820
>>
>> This wouldn't block release IMHO.
>>
>>
>> I am currently investigating this failure though -- seems like the
>> mechanism that downloads Spark tarballs needs fixing, or updating, in the
>> 2.2 branch?
>>
>> HiveExternalCatalogVersionsSuite:
>>
>> gzip: stdin: not in gzip format
>>
>> tar: Child returned status 1
>>
>> tar: Error is not recoverable: exiting now
>>
>> *** RUN ABORTED ***
>>
>>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
>> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or directory
>>
>> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC
>>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.1
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v2.2.1-rc2
>>> https://github.com/apache/spark/tree/v2.2.1-rc2  (
>>> e30e2698a2193f0bbdcd4edb884710819ab6397c)
>>>
>>> List of JIRA tickets resolved in this release can be found here
>>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>>
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1257/
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-docs/_site/index.html
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.2.2.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless t

[VOTE] Spark 2.2.1 (RC2)

2017-11-24 Thread Felix Cheung

Please vote on releasing the following candidate as Apache Spark version
2.2.1. The vote is open until Friday December 1, 2017 at 8:00:00 am UTC and
passes if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.2.1

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/


The tag to be voted on is v2.2.1-rc2 https://github.com/apache/
spark/tree/v2.2.1-rc2  (e30e2698a2193f0bbdcd4edb884710819ab6397c)

List of JIRA tickets resolved in this release can be found here
https://issues.apache.org/jira/projects/SPARK/versions/12340470


The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1257/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-
rc2-docs/_site/index.html


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install the
current RC and see if anything important breaks, in the Java/Scala you can
add the staging repository to your projects resolvers and test with the RC
(make sure to clean up the artifact cache before/after so you don't end up
building with a out of date RC going forward).

*What should happen to JIRA tickets still targeting 2.2.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.2.2.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.2.0. That being said if
there is something which is a regression form 2.2.0 that has not been
correctly targeted please ping a committer to help target the issue (you
can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here

.

*What are the unresolved issues targeted for 2.2.1
?*

At the time of the writing, there is one intermited failure SPARK-20201
 which we are tracking
since 2.2.0.

[RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Felix Cheung

This vote passes. Thanks everyone for testing this release.


+1:

Sean Owen (binding)

Herman van Hövell tot Westerflier (binding)

Wenchen Fan (binding)

Shivaram Venkataraman (binding)

Felix Cheung

Henry Robinson

Hyukjin Kwon

Dongjoon Hyun

Kazuaki Ishizaki

Holden Karau

Weichen Xu


0: None

-1: None




On Wed, Nov 29, 2017 at 3:21 PM Weichen Xu <weichen...@databricks.com>
wrote:

> +1
>
> On Thu, Nov 30, 2017 at 6:27 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> +1
>>
>> SHA, MD5 and signatures look fine. Built and ran Maven tests on my
>> Macbook.
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> PySpark install into a virtualenv works, PKG-INFO looks correctly
>>> populated (mostly checking for the pypandoc conversion there).
>>>
>>> Thanks for your hard work Felix (and all of the testers :)) :)
>>>
>>> On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>>>> for core/sql-core/sql-catalyst/mllib/mllib-local have passed.
>>>>>
>>>>> $ java -version
>>>>> openjdk version "1.8.0_131"
>>>>> OpenJDK Runtime Environment (build
>>>>> 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
>>>>> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>>>>>
>>>>> % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>>>> -T 24 clean package install
>>>>> % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>>>> core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
>>>>> ...
>>>>> Run completed in 13 minutes, 54 seconds.
>>>>> Total number of tests run: 1118
>>>>> Suites: completed 170, aborted 0
>>>>> Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
>>>>> All tests passed.
>>>>> [INFO]
>>>>> 
>>>>> [INFO] Reactor Summary:
>>>>> [INFO]
>>>>> [INFO] Spark Project Core . SUCCESS
>>>>> [17:13 min]
>>>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>>>>  6.065 s]
>>>>> [INFO] Spark Project Catalyst . SUCCESS
>>>>> [11:51 min]
>>>>> [INFO] Spark Project SQL .. SUCCESS
>>>>> [17:55 min]
>>>>> [INFO] Spark Project ML Library ....... SUCCESS
>>>>> [17:05 min]
>>>>> [INFO]
>>>>> 
>>>>> [INFO] BUILD SUCCESS
>>>>> [INFO]
>>>>> 
>>>>> [INFO] Total time: 01:04 h
>>>>> [INFO] Finished at: 2017-11-30T01:48:15+09:00
>>>>> [INFO] Final Memory: 128M/329M
>>>>> [INFO]
>>>>> 
>>>>> [WARNING] The requested profile "hive" could not be activated because
>>>>> it does not exist.
>>>>>
>>>>> Kazuaki Ishizaki
>>>>>
>>>>>
>>>>>
>>>>> From:Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> To:Hyukjin Kwon <gurwls...@gmail.com>
>>>>> Cc:Spark dev list <dev@spark.apache.org>, Felix Cheung <
>>>>> felixche...@apache.org>, Sean Owen <so...@cloudera.com>
>>>>> Date:2017/11/29 12:56
>>>>> Subject:Re: [VOTE] Spark 2.2.1 (RC2)
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> RC2 is tested on CentOS, too.
>>>>>
>>>>> Bests,
>>>>> Dongjoon

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-14 Thread Felix Cheung

;)
The credential to the user to publish to PyPI is PMC only.

+Holden

Had discussed this in the other thread I sent to private@ last week.


On Thu, Dec 14, 2017 at 4:34 AM Sean Owen <so...@cloudera.com> wrote:

> On the various access questions here -- what do you need to have that
> access? We definitely need to give you all necessary access if you're the
> release manager!
>
>
> On Thu, Dec 14, 2017 at 6:32 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> And I don’t have access to publish python.
>>
>> On Wed, Dec 13, 2017 at 9:55 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> The R artifacts have some issue that Felix and I are debugging. Lets not
>>> block the announcement for that.
>>>
>>>
>>>

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-14 Thread Felix Cheung

And I don’t have access to publish python.

On Wed, Dec 13, 2017 at 9:55 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The R artifacts have some issue that Felix and I are debugging. Lets not
> block the announcement for that.
>
> Thanks
>
> Shivaram
>
> On Wed, Dec 13, 2017 at 5:59 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Looks like Maven artifacts are up, site's up -- what about the Python and
>> R artifacts?
>> I can also move the spark.apache/docs/latest link to point to 2.2.1 if
>> it's pretty ready.
>> We should announce the release officially too then.
>>
>> On Wed, Dec 6, 2017 at 5:00 PM Felix Cheung <felixche...@apache.org>
>> wrote:
>>
>>> I saw the svn move on Monday so I’m working on the website updates.
>>>
>>> I will look into maven today. I will ask if I couldn’t do it.
>>>
>>>
>>> On Wed, Dec 6, 2017 at 10:49 AM Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Pardon, did this release finish? I don't see it in Maven. I know there
>>>> was some question about getting a hand in finishing the release process,
>>>> including copying artifacts in svn. Was there anything else you're waiting
>>>> on someone to do?
>>>>
>>>>
>>>> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung <felixche...@apache.org>
>>>> wrote:
>>>>
>>>>> This vote passes. Thanks everyone for testing this release.
>>>>>
>>>>>
>>>>> +1:
>>>>>
>>>>> Sean Owen (binding)
>>>>>
>>>>> Herman van Hövell tot Westerflier (binding)
>>>>>
>>>>> Wenchen Fan (binding)
>>>>>
>>>>> Shivaram Venkataraman (binding)
>>>>>
>>>>> Felix Cheung
>>>>>
>>>>> Henry Robinson
>>>>>
>>>>> Hyukjin Kwon
>>>>>
>>>>> Dongjoon Hyun
>>>>>
>>>>> Kazuaki Ishizaki
>>>>>
>>>>> Holden Karau
>>>>>
>>>>> Weichen Xu
>>>>>
>>>>>
>>>>> 0: None
>>>>>
>>>>> -1: None
>>>>>
>>>>
>

Re: [VOTE] Spark 2.2.1 (RC1)

2017-11-17 Thread Felix Cheung

I wasn’t able to test this out.

Is anyone else seeing this error? I see a few JVM fixes and getting back
ported, are they related to this?

This issue seems important to hold any update until we know more.

On Wed, Nov 15, 2017 at 7:01 PM Sean Owen <so...@cloudera.com> wrote:

> The signature is fine, with your new sig. Updated hashes look fine too.
> LICENSE is still fine to my knowledge.
>
> Is anyone else seeing this failure?
>
> - GenerateOrdering with ShortType
> *** RUN ABORTED ***
> java.lang.StackOverflowError:
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>
> This looks like SPARK-16845 again; see
> https://issues.apache.org/jira/browse/SPARK-16845?focusedCommentId=16018840=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16018840
>
>
> On Wed, Nov 15, 2017 at 12:25 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.2.1. The vote is open until Monday November 20, 2017 at 23:00 UTC and
> passes if a majority of at least 3 PMC +1 votes are cast.
>
>
>
>
> [ ] +1 Release this package as Apache Spark 2.2.1
>
>
> [ ] -1 Do not release this package because ...
>
>
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
>
>
>
> The tag to be voted on is v2.2.1-rc1
> https://github.com/apache/spark/tree/v2.2.1-rc1
> (41116ab7fca46db7255b01e8727e2e5d571a3e35)
>
>
> List of JIRA tickets resolved in this release can be found here
> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>
>
>
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-bin/
>
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1256/
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-docs/_site/index.html
>
>
>
>
> FAQ
>
>
> How can I help test this release?
>
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
>
> What should happen to JIRA tickets still targeting 2.2.1?
>
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.2.2.
>
>
> But my bug isn't fixed!??!
>
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said if
> there is something which is a regression form 2.2.0 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here
> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.2.1%20OR%20affectedVersion%20%3D%202.2.2)>
> .
>
>
> What are the unresolved issues targeted for 2.2.1
> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.1>
> ?
>
>
> At the time of the writing, there is one resolved SPARK-22471
> <https://issues.apache.org/jira/browse/SPARK-22471> would help stability,
> and one in progress on joins SPARK-22042
> <https://issues.apache.org/jira/browse/SPARK-22042>
>
>
>
>
>

Re: Cutting the RC for Spark 2.2.1 release

2017-11-14 Thread Felix Cheung

Now I’m seeing an error with Closing nexus staging repository.

staged_repo_id=orgapachespark-1254

< HTTP/1.1 401 Unauthorized
< Date: Tue, 14 Nov 2017 12:32:57 GMT
< Server: Nexus/2.13.0-01
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
* Authentication problem. Ignoring this.
< WWW-Authenticate: BASIC realm="Sonatype Nexus Repository Manager API"
< Content-Length: 0
< Via: 1.1 repository.apache.org<http://repository.apache.org>

Does working with Nexus require special permission in LDAP?
I couldn’t login to the web interface at repository.apache.org either.
____
From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Monday, November 13, 2017 11:23:44 AM
To: Sean Owen
Cc: Holden Karau; dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

Ouch ;) yes that works and RC1 is tagged.



From: Sean Owen <so...@cloudera.com>
Sent: Monday, November 13, 2017 10:54:48 AM
To: Felix Cheung
Cc: Holden Karau; dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

It's repo.maven.apache.org<http://repo.maven.apache.org> ?

On Mon, Nov 13, 2017 at 12:52 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I did change it, but getting unknown host?

[ERROR] Non-resolvable parent POM for 
org.apache.spark:spark-parent_2.11:2.2.1-SNAPSHOT: Could not transfer artifact 
org.apache:apache:pom:14 from/to central (https://repo.maven.org/maven2): 
repo.maven.org<http://repo.maven.org>: Name or service not known and 
'parent.relativePath' points at wrong local POM @ line 22, column 11: Unknown 
host repo.maven.org<http://repo.maven.org>: Name or service not known -> [Help 
2]

Re: Cutting the RC for Spark 2.2.1 release

2017-11-13 Thread Felix Cheung

Anything to build with maven on a clean machine.
It couldn’t connect to maven central repo.

From: Holden Karau <hol...@pigscanfly.ca>
Sent: Monday, November 13, 2017 10:38:03 AM
To: Felix Cheung
Cc: dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

Which script is this from?

On Mon, Nov 13, 2017 at 10:37 AM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Build/test looks good but I’m hitting a new issue with sonatype when tagging

"Host name 'repo1.maven.org<http://repo1.maven.org>' does not match the 
certificate subject provided by the peer 
(CN=repo.maven.apache.org<http://repo.maven.apache.org>, O="Sonatype, Inc", 
L=Fulton, ST=MD, C=US)"

https://issues.sonatype.org/browse/MVNCENTRAL-1369

Stay tuned.

____
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Monday, November 13, 2017 12:00:41 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Cutting the RC for Spark 2.2.1 release

Quick update:

We merged 6 fixes Friday and 7 fixes today (thanks!), since some are 
hand-merged I’m waiting for clean builds from Jenkins and test passes. As of 
now it looks like we need to take one more fix for Scala 2.10.

With any luck we should be tagging for build tomorrow morning (PT).

There should not be any issue targetting 2.2.1 except for SPARK-22042. As it is 
not a regression and it seems it might take a while, we won’t be blocking the 
release.

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Wednesday, November 8, 2017 3:57 PM
Subject: Cutting the RC for Spark 2.2.1 release
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Hi!

As we are closing down on the few known issues I think we are ready to tag and 
cut the 2.2.1 release.

If you are aware of any issue that you think should go into this release please 
feel free to ping me and mark the JIRA as targeting 2.2.1. I will be scrubbing 
JIRA in the next few days.

So unless we hear otherwise, I’m going to tag and build the RC starting 
Saturday EOD (PT). Please be patient since I’m going to be new at this :) but 
will keep the dev@ posted for any update.

Yours
RM for 2.2.1

--
Twitter: https://twitter.com/holdenkarau

Re: Cutting the RC for Spark 2.2.1 release

2017-11-13 Thread Felix Cheung

Ouch ;) yes that works and RC1 is tagged.



From: Sean Owen <so...@cloudera.com>
Sent: Monday, November 13, 2017 10:54:48 AM
To: Felix Cheung
Cc: Holden Karau; dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

It's repo.maven.apache.org<http://repo.maven.apache.org> ?

On Mon, Nov 13, 2017 at 12:52 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I did change it, but getting unknown host?

[ERROR] Non-resolvable parent POM for 
org.apache.spark:spark-parent_2.11:2.2.1-SNAPSHOT: Could not transfer artifact 
org.apache:apache:pom:14 from/to central (https://repo.maven.org/maven2): 
repo.maven.org<http://repo.maven.org>: Name or service not known and 
'parent.relativePath' points at wrong local POM @ line 22, column 11: Unknown 
host repo.maven.org<http://repo.maven.org>: Name or service not known -> [Help 
2]

Re: Cutting the RC for Spark 2.2.1 release

2017-11-13 Thread Felix Cheung

Build/test looks good but I’m hitting a new issue with sonatype when tagging

"Host name 'repo1.maven.org' does not match the certificate subject provided by 
the peer (CN=repo.maven.apache.org, O="Sonatype, Inc", L=Fulton, ST=MD, C=US)"

https://issues.sonatype.org/browse/MVNCENTRAL-1369

Stay tuned.

________
From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Monday, November 13, 2017 12:00:41 AM
To: dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

Quick update:

We merged 6 fixes Friday and 7 fixes today (thanks!), since some are 
hand-merged I’m waiting for clean builds from Jenkins and test passes. As of 
now it looks like we need to take one more fix for Scala 2.10.

With any luck we should be tagging for build tomorrow morning (PT).

There should not be any issue targetting 2.2.1 except for SPARK-22042. As it is 
not a regression and it seems it might take a while, we won’t be blocking the 
release.

_________
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Wednesday, November 8, 2017 3:57 PM
Subject: Cutting the RC for Spark 2.2.1 release
To: <dev@spark.apache.org<mailto:dev@spark.apache.org>>


Hi!

As we are closing down on the few known issues I think we are ready to tag and 
cut the 2.2.1 release.

If you are aware of any issue that you think should go into this release please 
feel free to ping me and mark the JIRA as targeting 2.2.1. I will be scrubbing 
JIRA in the next few days.

So unless we hear otherwise, I’m going to tag and build the RC starting 
Saturday EOD (PT). Please be patient since I’m going to be new at this :) but 
will keep the dev@ posted for any update.

Yours
RM for 2.2.1

Re: Cutting the RC for Spark 2.2.1 release

2017-11-13 Thread Felix Cheung

I did change it, but getting unknown host?

[ERROR] Non-resolvable parent POM for 
org.apache.spark:spark-parent_2.11:2.2.1-SNAPSHOT: Could not transfer artifact 
org.apache:apache:pom:14 from/to central (https://repo.maven.org/maven2): 
repo.maven.org<http://repo.maven.org>: Name or service not known and 
'parent.relativePath' points at wrong local POM @ line 22, column 11: Unknown 
host repo.maven.org<http://repo.maven.org>: Name or service not known -> [Help 
2]

_
From: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>
Sent: Monday, November 13, 2017 10:48 AM
Subject: Re: Cutting the RC for Spark 2.2.1 release
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


I'm not seeing a problem building, myself. However we could change the location 
of the Maven Repository in our POM to https://repo.maven.apache.org/maven2/ 
without any consequence.

The only reason we overrode it was to force it to use HTTPS which still doesn't 
look like the default (!): 
https://maven.apache.org/guides/introduction/introduction-to-the-pom.html#Super_POM

On a related note, we could also update the POM to inherit from the latest 
Apache parent POM, while we're at it, to get the latest declarations relevant 
to the ASF. Doesn't need to happen in 2.2.x

On Mon, Nov 13, 2017 at 12:39 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Anything to build with maven on a clean machine.
It couldn’t connect to maven central repo.

Re: [CANCEL] Spark 2.2.1 (RC1)

2017-11-19 Thread Felix Cheung

This vote is cancelled due to no vote.

I’m going to test or track down a few issues (please see link below for
those targeting this release) and roll RC2 in a few days if we could make
progress.


On Tue, Nov 14, 2017 at 10:25 PM Felix Cheung <felixche...@apache.org>
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.2.1. The vote is open until Monday November 20, 2017 at 23:00 UTC and
> passes if a majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.2.1
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
>
> The tag to be voted on is v2.2.1-rc1
> https://github.com/apache/spark/tree/v2.2.1-rc1
> (41116ab7fca46db7255b01e8727e2e5d571a3e35)
>
> List of JIRA tickets resolved in this release can be found here
> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1256/
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc1-docs/_site/index.html
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.2.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.2.2.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.2.0. That being said if
> there is something which is a regression form 2.2.0 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.2.1 / 2.2.2 here
> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.2.1%20OR%20affectedVersion%20%3D%202.2.2)>
> .
>
> What are the unresolved issues targeted for 2.2.1
> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.1>
> ?
>
> At the time of the writing, there is one resolved SPARK-22471
> <https://issues.apache.org/jira/browse/SPARK-22471> would help stability,
> and one in progress on joins SPARK-22042
> <https://issues.apache.org/jira/browse/SPARK-22042>
>
>
>
>

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Felix Cheung

For the 2.2.1, we are still working through a few bugs. Hopefully it won't be 
long.

From: Kevin Grealish <kevin...@microsoft.com>
Sent: Thursday, November 2, 2017 9:51:56 AM
To: Felix Cheung; Sean Owen; Holden Karau
Cc: dev@spark.apache.org
Subject: RE: Kicking off the process around Spark 2.2.1

Any update on expected 2.2.1 (or 2.3.0) release process?

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, October 26, 2017 10:04 AM
To: Sean Owen <so...@cloudera.com>; Holden Karau <hol...@pigscanfly.ca>
Cc: dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.

From: Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: 
https://twitter.com/holdenkarau<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau=02%7C01%7Ckevingre%40microsoft.com%7C4112dae09f8c48f6b1f908d51c939572%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636446342556242224=sy9vsBZmlSWj0OYa2c9aNNlWeS0BJNUMMboSINtIQCY%3D=0>

Re: Kicking off the process around Spark 2.2.1

2017-11-08 Thread Felix Cheung

Ok I think we are there, after getting rounds of fixes through in the last few 
weeks.

I'm going to kick off a separate thread on this to be absolutely clear.

From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden Karau 
<hol...@pigscanfly.ca>
Sent: Thursday, November 2, 2017 12:47:13 PM
To: Reynold Xin
Cc: Felix Cheung; Sean Owen; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

I agree, except in this case we probably want some of the fixes that are going 
into the maintenance release to be present in the new feature release (like the 
CRAN issue).

On Thu, Nov 2, 2017 at 12:12 PM, Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:
Why tie a maintenance release to a feature release? They are supposed to be 
independent and we should be able to make a lot of maintenance releases as 
needed.

On Thu, Nov 2, 2017 at 7:13 PM Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
The feature freeze is "mid November" : 
http://spark.apache.org/versioning-policy.html
Let's say... Nov 15? any body have a better date?

Although it'd be nice to get 2.2.1 out sooner than later in all events, and 
kind of makes sense to get out first, they need not go in order. It just might 
be distracting to deal with 2 at once.

(BTW there was still one outstanding issue from the last release: 
https://issues.apache.org/jira/browse/SPARK-22401 )

On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
I think it will be great to set a feature freeze date for 2.3.0 first, as a 
minor release. There are a few new stuff that would be good to have and then we 
will likely need time to stabilize, before cutting RCs.

--
Twitter: https://twitter.com/holdenkarau

Cutting the RC for Spark 2.2.1 release

2017-11-08 Thread Felix Cheung

Hi!

As we are closing down on the few known issues I think we are ready to tag and 
cut the 2.2.1 release.

If you are aware of any issue that you think should go into this release please 
feel free to ping me and mark the JIRA as targeting 2.2.1. I will be scrubbing 
JIRA in the next few days.

So unless we hear otherwise, I’m going to tag and build the RC starting 
Saturday EOD (PT). Please be patient since I’m going to be new at this :) but 
will keep the dev@ posted for any update.

Yours
RM for 2.2.1

Re: Cutting the RC for Spark 2.2.1 release

2017-11-08 Thread Felix Cheung

Thanks Dongjoon! I will track that.

From: Dongjoon Hyun <dongjoon.h...@gmail.com>
Sent: Wednesday, November 8, 2017 7:41:20 PM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Cutting the RC for Spark 2.2.1 release

It's great, Felix!

As of today, `branch-2.2` seems to be broken due to SPARK-22211 (Scala UT 
failure) and SPARK-22417 (Python UT failure).
I pinged you at both.

Bests,
Dongjoon.

On Wed, Nov 8, 2017 at 5:51 PM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Thanks for stepping up and running the 2.2.1 release :)

On Wed, Nov 8, 2017 at 3:57 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Hi!

As we are closing down on the few known issues I think we are ready to tag and 
cut the 2.2.1 release.

If you are aware of any issue that you think should go into this release please 
feel free to ping me and mark the JIRA as targeting 2.2.1. I will be scrubbing 
JIRA in the next few days.

So unless we hear otherwise, I’m going to tag and build the RC starting 
Saturday EOD (PT). Please be patient since I’m going to be new at this :) but 
will keep the dev@ posted for any update.

Yours
RM for 2.2.1

--
Twitter: https://twitter.com/holdenkarau

Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-06 Thread Felix Cheung

I saw the svn move on Monday so I’m working on the website updates.

I will look into maven today. I will ask if I couldn’t do it.


On Wed, Dec 6, 2017 at 10:49 AM Sean Owen <so...@cloudera.com> wrote:

> Pardon, did this release finish? I don't see it in Maven. I know there was
> some question about getting a hand in finishing the release process,
> including copying artifacts in svn. Was there anything else you're waiting
> on someone to do?
>
>
> On Fri, Dec 1, 2017 at 2:10 AM Felix Cheung <felixche...@apache.org>
> wrote:
>
>> This vote passes. Thanks everyone for testing this release.
>>
>>
>> +1:
>>
>> Sean Owen (binding)
>>
>> Herman van Hövell tot Westerflier (binding)
>>
>> Wenchen Fan (binding)
>>
>> Shivaram Venkataraman (binding)
>>
>> Felix Cheung
>>
>> Henry Robinson
>>
>> Hyukjin Kwon
>>
>> Dongjoon Hyun
>>
>> Kazuaki Ishizaki
>>
>> Holden Karau
>>
>> Weichen Xu
>>
>>
>> 0: None
>>
>> -1: None
>>
>

Re: CRAN SparkR package removed?

2017-10-25 Thread Felix Cheung

Yes - unfortunately something was found after it was published and made 
available publicly.

We have a JIRA on this and are working on the best course of action.

_
From: Holden Karau >
Sent: Wednesday, October 25, 2017 1:35 AM
Subject: CRAN SparkR package removed?
To: >

Looking at https://cran.r-project.org/web/packages/SparkR/ it seems like the 
package has been removed. Any ideas what's up?

(Just asking since I'm working on the release e-mail and it was also mentioned 
in the keynote just now).

--
Twitter: https://twitter.com/holdenkarau

Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Felix Cheung

Huge +1 on this!

From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]

--
Twitter: https://twitter.com/holdenkarau

Re: Running lint-java during PR builds?

2018-05-21 Thread Felix Cheung

One concern is with the volume of test runs on Travis.

In ASF projects Travis could get significantly
backed up since - if I recall - all of ASF shares one queue.

At the number of PRs Spark has this could be a big issue.



From: Marcelo Vanzin 
Sent: Monday, May 21, 2018 9:08:28 AM
To: Hyukjin Kwon
Cc: Dongjoon Hyun; dev
Subject: Re: Running lint-java during PR builds?

I'm fine with it. I tried to use the existing checkstyle sbt plugin
(trying to fix SPARK-22269), but it depends on an ancient version of
checkstyle, and I don't know sbt enough to figure out how to hack
classpaths and class loaders when applying rules, so gave up.

On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon  wrote:
> I am going to open an INFRA JIRA if there's no explicit objection in few
> days.
>
> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon :
>>
>> I would like to revive this proposal. Travis CI. Shall we give this try? I
>> think it's worth trying it.
>>
>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun :
>>>
>>> Hi, Marcelo and Ryan.
>>>
>>> That was the main purpose of my proposal about Travis.CI.
>>> IMO, that is the only way to achieve that without any harmful side-effect
>>> on Jenkins infra.
>>>
>>> Spark is already ready for that. Like AppVoyer, if one of you files an
>>> INFRA jira issue to enable that, they will turn on that. Then, we can try it
>>> and see the result. Also, you can turn off easily again if you don't want.
>>>
>>> Without this, we will consume more community efforts. For example, we
>>> merged lint-java error fix PR seven hours ago, but the master branch still
>>> has one lint-java error.
>>>
>>> https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
>>>
>>> Actually, I've been monitoring the history here. (It's synced every 30
>>> minutes.)
>>>
>>> https://travis-ci.org/dongjoon-hyun/spark/builds
>>>
>>> Could we give a change to this?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
>>>  wrote:
>>> > I remember it's because you need to run `mvn install` before running
>>> > lint-java if the maven cache is empty, and `mvn install` is pretty
>>> > heavy.
>>> >
>>> > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin 
>>> > wrote:
>>> >
>>> > > Hey all,
>>> > >
>>> > > Is there a reason why lint-java is not run during PR builds? I see it
>>> > > seems to be maven-only, is it really expensive to run after an sbt
>>> > > build?
>>> > >
>>> > > I see a lot of PRs coming in to fix Java style issues, and those all
>>> > > seem a little unnecessary. Either we're enforcing style checks or
>>> > > we're not, and right now it seems we aren't.
>>> > >
>>> > > --
>>> > > Marcelo
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>



--
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Revisiting Online serving of Spark models?

2018-05-21 Thread Felix Cheung

+1 on meeting up!

From: Holden Karau <hol...@pigscanfly.ca>
Sent: Monday, May 21, 2018 2:52:20 PM
To: Joseph Bradley
Cc: Felix Cheung; dev
Subject: Re: Revisiting Online serving of Spark models?

(Oh also the write API has already been extended to take formats).

On Mon, May 21, 2018 at 2:51 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I like that idea. I’ll be around Spark Summit.

On Mon, May 21, 2018 at 1:52 PM Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:h

Re: SparkR was removed from CRAN on 2018-05-01

2018-05-25 Thread Felix Cheung

This is the fix
https://github.com/apache/spark/commit/f27a035daf705766d3445e5c6a99867c11c552b0#diff-e1e1d3d40573127e9ee0480caf1283d6

I don't have the email though.


From: Hossein 
Sent: Friday, May 25, 2018 10:58:42 AM
To: dev@spark.apache.org
Subject: SparkR was removed from CRAN on 2018-05-01

Would you please forward the email from CRAN? Is there a JIRA?

Thanks,
--Hossein

Re: Revisiting Online serving of Spark models?

2018-05-20 Thread Felix Cheung

Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley <jos...@databricks.com>
Cc: dev <dev@spark.apache.org>


Huge +1 on this!


From: holden.ka...@gmail.com <holden.ka...@gmail.com> on behalf of Holden Karau 
<hol...@pigscanfly.ca>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>



--
Twitter: https://twitter.com/holdenkarau

Re: Integrating ML/DL frameworks with Spark

2018-05-20 Thread Felix Cheung

Very cool. We would be very interested in this.

What is the plan forward to make progress in each of the three areas?

From: Bryan Cutler 
Sent: Monday, May 14, 2018 11:37:20 PM
To: Xiangrui Meng
Cc: Reynold Xin; dev
Subject: Re: Integrating ML/DL frameworks with Spark

Thanks for starting this discussion, I'd also like to see some improvements in 
this area and glad to hear that the Pandas UDFs / Arrow functionality might be 
useful.  I'm wondering if from your initial investigations you found anything 
lacking from the Arrow format or possible improvements that would simplify the 
data representation?  Also, while data could be handed off in a UDF, would it 
make sense to also discuss a more formal way to externalize the data in a way 
that would also work for the Scala API?

Thanks,
Bryan

On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng 
> wrote:
Shivaram: Yes, we can call it "gang scheduling" or "barrier synchronization". 
Spark doesn't support it now. The proposal is to have a proper support in 
Spark's job scheduler, so we can integrate well with MPI-like frameworks.

On Tue, May 8, 2018 at 11:17 AM Nan Zhu 
> wrote:
.how I skipped the last part

On Tue, May 8, 2018 at 11:16 AM, Reynold Xin 
> wrote:
Yes, Nan, totally agree. To be on the same page, that's exactly what I wrote 
wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu 
> wrote:
besides that, one of the things which is needed by multiple frameworks is to 
schedule tasks in a single wave

i.e.

if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark is 
desired to provide a capability to ensure that either we run 50 tasks at once, 
or we should quit the complete application/job after some timeout period

Best,

Nan

On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
> wrote:
I think that's what Xiangrui was referring to. Instead of retrying a single 
task, retry the entire stage, and the entire stage of tasks need to be 
scheduled all at once.

On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman 
> wrote:

  *   Fault tolerance and execution model: Spark assumes fine-grained task 
recovery, i.e. if something fails, only that task is rerun. This doesn’t match 
the execution model of distributed ML/DL frameworks that are typically 
MPI-based, and rerunning a single task would lead to the entire system hanging. 
A whole stage needs to be re-run.

This is not only useful for integrating with 3rd-party frameworks, but also 
useful for scaling MLlib algorithms. One of my earliest attempts in Spark MLlib 
was to implement All-Reduce primitive 
(SPARK-1485). But we ended up 
with some compromised solutions. With the new execution model, we can set up a 
hybrid cluster and do all-reduce properly.

Is there a particular new execution model you are referring to or do we plan to 
investigate a new execution model ?  For the MPI-like model, we also need gang 
scheduling (i.e. schedule all tasks at once or none of them) and I dont think 
we have support for that in the scheduler right now.

--

Xiangrui Meng

Software Engineer

Databricks Inc. [http://databricks.com] 

--

Xiangrui Meng

Software Engineer

Databricks Inc. [http://databricks.com]

Re: Scala 2.12 support

2018-06-07 Thread Felix Cheung

+1

Spoke to Dean as well and mentioned the problem with 2.11.12 
https://github.com/scala/bug/issues/10913

_
From: Sean Owen 
Sent: Wednesday, June 6, 2018 12:23 PM
Subject: Re: Scala 2.12 support
To: Holden Karau 
Cc: Dean Wampler , Reynold Xin , 
dev 

If it means no change to 2.11 support, seems OK to me for Spark 2.4.0. The 2.12 
support is separate and has never been mutually compatible with 2.11 builds 
anyway. (I also hope, suspect that the changes are minimal; tests are already 
almost entirely passing with no change to the closure cleaner when built for 
2.12)

On Wed, Jun 6, 2018 at 1:33 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Just chatted with Dean @ the summit and it sounds like from Adriaan there is a 
fix in 2.13 for the API change issue that could be back ported to 2.12 so how 
about we try and get this ball rolling?

It sounds like it would also need a closure cleaner change, which could be 
backwards compatible but since it’s such a core component and we might want to 
be cautious with it, we could when building for 2.11 use the old cleaner code 
and for 2.12 use the new code so we don’t break anyone.

How do folks feel about this?

Re: Time for 2.2.2 release

2018-06-07 Thread Felix Cheung

+1 and thanks!

From: Tom Graves 
Sent: Wednesday, June 6, 2018 7:54:43 AM
To: Dev
Subject: Time for 2.2.2 release

Hello all,

I think its time for another 2.2 release.
I took a look at Jira and I don't see anything explicitly targeted for 2.2.2 
that is not yet complete.

So I'd like to propose to release 2.2.2 soon. If there are important
fixes that should go into the release, please let those be known (by
replying here or updating the bug in Jira), otherwise I'm volunteering
to prepare the first RC soon-ish (by early next week since Spark Summit is this 
week).

Thanks!
Tom Graves

Re: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

2018-06-12 Thread Felix Cheung

For #1 is system requirements not honored?

For #2 it looks like Oracle JDK?

From: Shivaram Venkataraman 
Sent: Tuesday, June 12, 2018 3:17:52 PM
To: dev
Cc: Felix Cheung
Subject: Fwd: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1

Corresponding to the Spark 2.3.1 release, I submitted the SparkR build
to CRAN yesterday. Unfortunately it looks like there are a couple of
issues (full message from CRAN is forwarded below)

1. There are some builds started with Java 10
(http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Debian/00check.log)
which are right now counted as test failures. I wonder if we should
somehow mark them as skipped ? I can ping the CRAN team about this.

2. There is another issue with Java version parsing which
unfortunately affects even Java 8 builds. I've created
https://issues.apache.org/jira/browse/SPARK-24535 to track this.

Thanks
Shivaram

-- Forwarded message -
From: 
Date: Mon, Jun 11, 2018 at 11:24 AM
Subject: [CRAN-pretest-archived] CRAN submission SparkR 2.3.1
To: 
Cc: 

Dear maintainer,

package SparkR_2.3.1.tar.gz does not pass the incoming checks
automatically, please see the following pre-tests:
Windows: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Windows/00check.log>
Status: 2 ERRORs, 1 NOTE
Debian: 
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/Debian/00check.log>
Status: 1 ERROR, 1 WARNING, 1 NOTE

Last released version's CRAN status: ERROR: 1, OK: 1
See: <https://CRAN.R-project.org/web/checks/check_results_SparkR.html>

CRAN Web: <https://cran.r-project.org/package=SparkR>

Please fix all problems and resubmit a fixed version via the webform.
If you are not sure how to fix the problems shown, please ask for help
on the R-package-devel mailing list:
<https://stat.ethz.ch/mailman/listinfo/r-package-devel>
If you are fairly certain the rejection is a false positive, please
reply-all to this message and explain.

More details are given in the directory:
<https://win-builder.r-project.org/incoming_pretest/SparkR_2.3.1_20180611_200923/>
The files will be removed after roughly 7 days.

No strong reverse dependencies to be checked.

Best regards,
CRAN teams' auto-check service
Flavor: r-devel-linux-x86_64-debian-gcc, r-devel-windows-ix86+x86_64
Check: CRAN incoming feasibility, Result: NOTE
  Maintainer: 'Shivaram Venkataraman '

  New submission

  Package was archived on CRAN

  Possibly mis-spelled words in DESCRIPTION:
Frontend (4:10, 5:28)

  CRAN repository db overrides:
X-CRAN-Comment: Archived on 2018-05-01 as check problems were not
  corrected despite reminders.

Flavor: r-devel-windows-ix86+x86_64
Check: running tests for arch 'i386', Result: ERROR
Running 'run-all.R' [30s]
  Running the tests in 'tests/run-all.R' failed.
  Complete output:
> #
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> #
>
> library(testthat)
> library(SparkR)

Attaching package: 'SparkR'

The following objects are masked from 'package:testthat':

describe, not

The following objects are masked from 'package:stats':

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

>
> # Turn all warnings into errors
> options("warn" = 2)
>
> if (.Platform$OS.type == "windows") {
+   Sys.setenv(TZ = "GMT")
+ }
>
> # Setup global test environment
> # Install Spark first to set SPARK_HOME
>
> # NOTE(shivaram): We set overwrite to handle any old tar.gz
files or directories left behind on
> # CRAN machines. For Jenkins we should already have SPARK_HOME set.
> install.spark(overwrite = TRUE)
Overwrite = TRUE: download and overwrite the tar fileand Spark
package directo

Re: Revisiting Online serving of Spark models?

2018-05-26 Thread Felix Cheung

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal <sxk1...@hotmail.com>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
<maximilianofel...@gmail.com<mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
<leif.wa...@gmail.com<mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
<holden.ka...@gmail.com<mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql

Re: Revisiting Online serving of Spark models?

2018-05-30 Thread Felix Cheung

Hi!

Thank you! Let’s meet then

June 6 4pm

Moscone West Convention Center
800 Howard Street, San Francisco, CA 94103

Ground floor (outside of conference area - should be available for all) - we 
will meet and decide where to go

(Would not send invite because that would be too much noise for dev@)

To paraphrase Joseph, we will use this to kick off the discusssion and post 
notes after and follow up online. As for Seattle, I would be very interested to 
meet in person lateen and discuss ;)


_
From: Saikat Kanjilal 
Sent: Tuesday, May 29, 2018 11:46 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Maximiliano Felice 
Cc: Felix Cheung , Holden Karau 
, Joseph Bradley , Leif Walsh 
, dev 


Would love to join but am in Seattle, thoughts on how to make this work?

Regards

Sent from my iPhone

On May 29, 2018, at 10:35 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place Holden is 
talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung 
mailto:felixcheun...@hotmail.com>>:
You had me at blue bottle!

_
From: Holden Karau mailto:hol...@pigscanfly.ca>>
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung mailto:felixcheun...@hotmail.com>>
Cc: Saikat Kanjilal mailto:sxk1...@hotmail.com>>, 
Maximiliano Felice 
mailto:maximilianofel...@gmail.com>>, Joseph 
Bradley mailto:jos...@databricks.com>>, Leif Walsh 
mailto:leif.wa...@gmail.com>>, dev 
mailto:dev@spark.apache.org>>



I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
bottle and grab coffee (if the weather holds have our design meeting outside 
:p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Bump.


From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)


From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?


_
From: Felix Cheung mai

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Felix Cheung

+1
With my concerns in the SPIP discussion.

From: Hossein 
Sent: Wednesday, May 30, 2018 2:03:03 PM
To: dev@spark.apache.org
Subject: [VOTE] SPIP ML Pipelines in R

Hi,

I started discussion 
thread
 for a new R package to expose MLlib pipelines in 
R.

To summarize we will work on utilities to generate R wrappers for MLlib 
pipeline API for a new R package. This will lower the burden for exposing new 
API in future.

Following the SPIP 
process, I am proposing 
the SPIP for a vote.

+1: Let's go ahead and implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.

Thanks,
--Hossein

Re: Revisiting Online serving of Spark models?

2018-05-29 Thread Felix Cheung

Bump.

From: Felix Cheung 
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev
Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal 
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley mailto:jos...@databricks.com>>
Cc: dev mailto:dev@spark.apache.org>>

Huge +1 on this!

From: holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-lo

Re: Revisiting Online serving of Spark models?

2018-05-29 Thread Felix Cheung

You had me at blue bottle!

_
From: Holden Karau 
Sent: Tuesday, May 29, 2018 9:47 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Felix Cheung 
Cc: Saikat Kanjilal , Maximiliano Felice 
, Joseph Bradley , Leif 
Walsh , dev 

I'm down for that, we could all go for a walk maybe to the mint plazaa blue 
bottle and grab coffee (if the weather holds have our design meeting outside 
:p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Bump.

From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Saturday, May 26, 2018 1:05:29 PM
To: Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
Cc: Leif Walsh; Holden Karau; dev

Subject: Re: Revisiting Online serving of Spark models?

Hi! How about we meet the community and discuss on June 6 4pm at (near) the 
Summit?

(I propose we meet at the venue entrance so we could accommodate people might 
not be in the conference)

From: Saikat Kanjilal mailto:sxk1...@hotmail.com>>
Sent: Tuesday, May 22, 2018 7:47:07 AM
To: Maximiliano Felice
Cc: Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
Subject: Re: Revisiting Online serving of Spark models?

I’m in the same exact boat as Maximiliano and have use cases as well for model 
serving and would love to join this discussion.

Sent from my iPhone

On May 22, 2018, at 6:39 AM, Maximiliano Felice 
mailto:maximilianofel...@gmail.com>> wrote:

Hi!

I'm don't usually write a lot on this list but I keep up to date with the 
discussions and I'm a heavy user of Spark. This topic caught my attention, as 
we're currently facing this issue at work. I'm attending to the summit and was 
wondering if it would it be possible for me to join that meeting. I might be 
able to share some helpful usecases and ideas.

Thanks,
Maximiliano Felice

El mar., 22 de may. de 2018 9:14 AM, Leif Walsh 
mailto:leif.wa...@gmail.com>> escribió:
I’m with you on json being more readable than parquet, but we’ve had success 
using pyarrow’s parquet reader and have been quite happy with it so far. If 
your target is python (and probably if not now, then soon, R), you should look 
in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Regarding model reading and writing, I'll give quick thoughts here:
* Our approach was to use the same format but write JSON instead of Parquet.  
It's easier to parse JSON without Spark, and using the same format simplifies 
architecture.  Plus, some people want to check files into version control, and 
JSON is nice for that.
* The reader/writer APIs could be extended to take format parameters (just like 
DataFrame reader/writers) to handle JSON (and maybe, eventually, handle Parquet 
in the online serving setting).

This would be a big project, so proposing a SPIP might be best.  If people are 
around at the Spark Summit, that could be a good time to meet up & then post 
notes back to the dev list.

On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Specifically I’d like bring part of the discussion to Model and PipelineModel, 
and various ModelReader and SharedReadWrite implementations that rely on 
SparkContext. This is a big blocker on reusing  trained models outside of Spark 
for online serving.

What’s the next step? Would folks be interested in getting together to 
discuss/get some feedback?

_____
From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Thursday, May 10, 2018 10:10 AM
Subject: Re: Revisiting Online serving of Spark models?
To: Holden Karau mailto:hol...@pigscanfly.ca>>, Joseph 
Bradley mailto:jos...@databricks.com>>
Cc: dev mailto:dev@spark.apache.org>>

Huge +1 on this!

From:holden.ka...@gmail.com<mailto:holden.ka...@gmail.com> 
mailto:holden.ka...@gmail.com>> on behalf of Holden 
Karau mailto:hol...@pigscanfly.ca>>
Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?

On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
mailto:jos...@databricks.com>> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-27 Thread Felix Cheung

(I don’t want to block the release(s) per se...)

We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)

This is fixed in 2.3 back in Nov 2017 
https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6

Perhaps we don't get Jenkins run on these branches? It should have been 
detected.

* checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
deparse(substitute(what))

Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...


From: Sean Owen 
Sent: Wednesday, June 27, 2018 5:02:37 AM
To: Marcelo Vanzin
Cc: dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

+1 from me too for the usual reasons.

On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin  
wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.3.

The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.1.3
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
https://github.com/apache/spark/tree/v2.1.3-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1275/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/

The list of bug fixes going into 2.1.3 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12341660

Notes:

- RC1 was not sent for a vote. I had trouble building it, and by the time I got
  things fixed, there was a blocker bug filed. It was already tagged in git
  at that time.

- If testing the source package, I recommend using Java 8, even though 2.1
  supports Java 7 (and the RC was built with JDK 7). This is because Maven
  Central has updated some configuration that makes the default Java 7 SSL
  config not work.

- There are Maven artifacts published for Scala 2.10, but binary
releases are only
  available for Scala 2.11. This matches the previous release (2.1.2),
but if there's
  a need / desire to have pre-built distributions for Scala 2.10, I can probably
  amend the RC without having to create a new one.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.1.3?
===

The current list of open tickets targeted at 2.1.3 can be found at:
https://s.apache.org/spark-2.1.3

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a

Re: [VOTE] Spark 2.1.3 (RC2)

2018-06-27 Thread Felix Cheung

Yes, this is broken with newer version of R.

We check explicitly for warning for the R check which should fail the test run.


From: Marcelo Vanzin 
Sent: Wednesday, June 27, 2018 6:55 PM
To: Felix Cheung
Cc: Marcelo Vanzin; Tom Graves; dev
Subject: Re: [VOTE] Spark 2.1.3 (RC2)

Not sure I understand that bug. Is it a compatibility issue with new
versions of R?

It's at least marked as fixed in 2.2(.1).

We do run jenkins on these branches, but that seems like just a
warning, which would not fail those builds...

On Wed, Jun 27, 2018 at 6:12 PM, Felix Cheung  wrote:
> (I don’t want to block the release(s) per se...)
>
> We need to backport SPARK-22281 (to branch-2.1 and branch-2.2)
>
> This is fixed in 2.3 back in Nov 2017
> https://github.com/apache/spark/commit/2ca5aae47a25dc6bc9e333fb592025ff14824501#diff-e1e1d3d40573127e9ee0480caf1283d6
>
> Perhaps we don't get Jenkins run on these branches? It should have been
> detected.
>
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
> Code: function(what, pos = 2L, name = deparse(substitute(what),
> backtick = FALSE), warn.conflicts = TRUE)
> Docs: function(what, pos = 2L, name = deparse(substitute(what)),
> warn.conflicts = TRUE)
> Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs:
> deparse(substitute(what))
>
> Codoc mismatches from documentation object 'glm':
> glm
> Code: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
> NULL, ...)
> Docs: function(formula, family = gaussian, data, weights, subset,
> na.action, start = NULL, etastart, mustart, offset,
> control = list(...), model = TRUE, method = "glm.fit",
> x = FALSE, y = TRUE, contrasts = NULL, ...)
> Argument names in code not in docs:
> singular.ok
> Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
>
> 
> From: Sean Owen 
> Sent: Wednesday, June 27, 2018 5:02:37 AM
> To: Marcelo Vanzin
> Cc: dev
> Subject: Re: [VOTE] Spark 2.1.3 (RC2)
>
> +1 from me too for the usual reasons.
>
> On Tue, Jun 26, 2018 at 3:25 PM Marcelo Vanzin 
> wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.3.
>>
>> The vote is open until Fri, June 29th @ 9PM UTC (2PM PDT) and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.3
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.3-rc2 (commit b7eac07b):
>> https://github.com/apache/spark/tree/v2.1.3-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1275/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.1.3-rc2-docs/
>>
>> The list of bug fixes going into 2.1.3 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12341660
>>
>> Notes:
>>
>> - RC1 was not sent for a vote. I had trouble building it, and by the time
>> I got
>> things fixed, there was a blocker bug filed. It was already tagged in
>> git
>> at that time.
>>
>> - If testing the source package, I recommend using Java 8, even though 2.1
>> supports Java 7 (and the RC was built with JDK 7). This is because Maven
>> Central has updated some configuration that makes the default Java 7 SSL
>> config not work.
>>
>> - There are Maven artifacts published for Scala 2.10, but binary
>> releases are only
>> available for Scala 2.11. This matches the previous release (2.1.2),
>> but if there's
>> a need / desire to have pre-built distributions for Scala 2.10, I can
>> probably
>> amend the RC without having to create a new one.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> ===

Re: SparkR test failures in PR builder

2018-05-03 Thread Felix Cheung

This is resolved.

Please see https://issues.apache.org/jira/browse/SPARK-24152

From: Kazuaki Ishizaki 
Sent: Wednesday, May 2, 2018 4:51:11 PM
To: dev
Cc: Joseph Bradley; Hossein Falaki
Subject: Re: SparkR test failures in PR builder

I am not familiar with SparkR or CRAN. However, I remember that we had the 
similar situation.

Here is a great work at that time. When I have just visited this PR, I think 
that we have the similar situation (i.e. format error) again.
https://github.com/apache/spark/pull/20005

Any other comments are appreciated.

Regards,
Kazuaki Ishizaki

From:Joseph Bradley 
To:dev 
Cc:Hossein Falaki 
Date:2018/05/03 07:31
Subject:SparkR test failures in PR builder

Hi all,

Does anyone know why the PR builder keeps failing on SparkR's CRAN checks?  
I've seen this in a lot of unrelated PRs.  E.g.: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console

Hossein spotted this line:
```
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) :
  dims [product 24] do not match the length of object [0]
```
and suggested that it could be CRAN flakiness.  I'm not familiar with CRAN, but 
do others have thoughts about how to fix this?

Thanks!
Joseph

--
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.

Re: Kicking off the process around Spark 2.2.1

2017-10-26 Thread Felix Cheung

Yes! I can take on RM for 2.2.1.

We are still working out what to do with temp files created by Hive and Java 
that cause the policy issue with CRAN and will report back shortly, hopefully.

From: Sean Owen <so...@cloudera.com>
Sent: Wednesday, October 25, 2017 4:39:15 AM
To: Holden Karau
Cc: Felix Cheung; dev@spark.apache.org
Subject: Re: Kicking off the process around Spark 2.2.1

It would be reasonably consistent with the timing of other x.y.1 releases, and 
more release managers sounds useful, yeah.

Note also that in theory the code freeze for 2.3.0 starts in about 2 weeks.

On Wed, Oct 25, 2017 at 12:29 PM Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
Now that Spark 2.1.2 is out it seems like now is a good time to get started on 
the Spark 2.2.1 release. There are some streaming fixes I’m aware of that would 
be good to get into a release, is there anything else people are working on for 
2.2.1 we should be tracking?

To switch it up I’d like to suggest Felix to be the RM for this since there are 
also likely some R packaging changes to be included in the release. This also 
gives us a chance to see if my updated release documentation if enough for a 
new RM to get started from.

What do folks think?
--
Twitter: https://twitter.com/holdenkarau

Re: Kubernetes backend and docker images

2018-01-06 Thread Felix Cheung

+1

Thanks for taking on this.
That was my feedback on one of the long comment thread as well, I think we 
should have one docker image instead of 3 (also pending in the fork are python 
and R variant, we should consider having one that we official release instead 
of 9, for example)



From: 蒋星博 
Sent: Friday, January 5, 2018 10:57:53 PM
To: Marcelo Vanzin
Cc: dev
Subject: Re: Kubernetes backend and docker images

Agree it should be nice to have this simplification, and users can still create 
their custom images by copy/modifying the default one.
Thanks for bring this out Marcelo!

2018-01-05 17:06 GMT-08:00 Marcelo Vanzin 
>:
Hey all, especially those working on the k8s stuff.

Currently we have 3 docker images that need to be built and provided
by the user when starting a Spark app: driver, executor, and init
container.

When the initial review went by, I asked why do we need 3, and I was
told that's because they have different entry points. That never
really convinced me, but well, everybody wanted to get things in to
get the ball rolling.

But I still think that's not the best way to go. I did some pretty
simple hacking and got things to work with a single image:

https://github.com/vanzin/spark/commit/k8s-img

Is there a reason why that approach would not work? You could still
create separate images for driver and executor if wanted, but there's
no reason I can see why we should need 3 images for the simple case.

Note that the code there can be cleaned up still, and I don't love the
idea of using env variables to propagate arguments to the container,
but that works for now.

--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

Re: Integration testing and Scheduler Backends

2018-01-08 Thread Felix Cheung

How would (2) be uncommon elsewhere?

On Mon, Jan 8, 2018 at 10:16 PM Anirudh Ramanathan 
wrote:

> This is with regard to the Kubernetes Scheduler Backend and scaling the
> process to accept contributions. Given we're moving past upstreaming
> changes from our fork, and into getting *new* patches, I wanted to start
> this discussion sooner than later. This is more of a post-2.3 question -
> not something we're looking to solve right away.
>
> While unit tests are handy, they're not nearly as good at giving us
> confidence as a successful run of our integration tests against
> single/multi-node k8s clusters. Currently, we have integration testing
> setup at https://github.com/apache-spark-on-k8s/spark-integration and
> it's running continuously against apache/spark:master in
> pepperdata-jenkins
>  (on
> minikube) & k8s-testgrid
>  (in
> GKE clusters). Now, the question is - how do we make integration-tests
> part of the PR author's workflow?
>
> 1. Keep the integration tests in the separate repo and require that
> contributors run them, add new tests prior to accepting their PRs as a
> policy. Given minikube  is easy
> to setup and can run on a single-node, it would certainly be possible.
> Friction however, stems from contributors potentially having to modify the
> integration test code hosted in that separate repository when
> adding/changing functionality in the scheduler backend. Also, it's
> certainly going to lead to at least brief inconsistencies between the two
> repositories.
>
> 2. Alternatively, we check in the integration tests alongside the actual
> scheduler backend code. This would work really well and is what we did in
> our fork. It would have to be a separate package which would take certain
> parameters (like cluster endpoint) and run integration test code against a
> local or remote cluster. It would include least some code dealing with
> accessing the cluster, reading results from K8s containers, test fixtures,
> etc.
>
> I see value in adopting (2), given it's a clearer path for contributors
> and lets us keep the two pieces consistent, but it seems uncommon
> elsewhere. How do the other backends, i.e. YARN, Mesos and Standalone deal
> with accepting patches and ensuring that they do not break existing
> clusters? Is there automation employed for this thus far? Would love to get
> opinions on (1) v/s (2).
>
> Thanks,
> Anirudh
>
>
>

Re: data source v2 online meetup

2018-02-01 Thread Felix Cheung

+1 hangout

From: Xiao Li 
Sent: Wednesday, January 31, 2018 10:46:26 PM
To: Ryan Blue
Cc: Reynold Xin; dev; Wenchen Fen; Russell Spitzer
Subject: Re: data source v2 online meetup

Hi, Ryan,

wow, your Iceberg already used data source V2 API! That is pretty cool! I am 
just afraid these new APIs are not stable. We might deprecate or change some 
data source v2 APIs in the next version (2.4). Sorry for the inconvenience it 
might introduce.

Thanks for your feedback always,

Xiao

2018-01-31 15:54 GMT-08:00 Ryan Blue 
>:
Thanks for suggesting this, I think it's a great idea. I'll definitely attend 
and can talk about the changes that we've made DataSourceV2 to enable our new 
table format, Iceberg.

On Wed, Jan 31, 2018 at 2:35 PM, Reynold Xin 
> wrote:
Data source v2 API is one of the larger main changes in Spark 2.3, and whatever 
that has already been committed is only the first version and we'd need more 
work post-2.3 to improve and stablize it.

I think at this point we should stop making changes to it in branch-2.3, and 
instead focus on using the existing API and getting feedback for 2.4. Would 
people be interested in doing an online hangout to discuss this, perhaps in the 
month of Feb?

It'd be more productive if people attending the hangout have tried the API by 
implementing some new sources or porting an existing source over.

--
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung <felixcheun...@hotmail.com>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

From: Sameer Agarwal <sameer.a...@gmail.com>
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
<samee...@apache.org<mailto:samee...@apache.org>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.

[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html

FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).

--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen <sro...@gmail.com>
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: dev <dev@spark.apache.org>

Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Felix Cheung

Ah sorry I realize my wordings were unclear (not enough zzz or coffee)

So to clarify,
1) when searching for a word in the Sql function doc, it does return that 
search result page correctly, however, none of the link in result opens to the 
actual doc page, so to take the search I included as an example, if you click 
on approx_percentile, for instance, it brings open the web directory instead.

2) The second is the dist location we are voting on has a .iml file, which is 
normally not included in release or release RC and it is unsigned and without 
hash (therefore seems like it should not be in the release)

Thanks!

_
From: Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
Sent: Tuesday, February 20, 2018 2:24 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Sean Owen <sro...@gmail.com>, dev <dev@spark.apache.org>


FWIW The search result link works for me

Shivaram

On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
These are two separate things:

Does the search result links work for you?

The second is the dist location we are voting on has a .iml file.

_
From: Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>>
Sent: Tuesday, February 20, 2018 2:19 AM
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: dev <dev@spark.apache.org<mailto:dev@spark.apache.org>>



Maybe I misunderstand, but I don't see any .iml file in the 4 results on that 
page? it looks reasonable.

On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any idea with sql func docs search result returning broken links as below?

From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, February 18, 2018 10:05:22 AM
To: Sameer Agarwal; Sameer Agarwal

Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)
Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-18 Thread Felix Cheung

Quick questions:

is there search link for sql functions quite right? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/search.html?q=app

this file shouldn't be included? 
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml



From: Sameer Agarwal 
Sent: Saturday, February 17, 2018 1:43:39 PM
To: Sameer Agarwal
Cc: dev
Subject: Re: [VOTE] Spark 2.3.0 (RC4)

I'll start with a +1 once again.

All blockers reported against RC3 have been resolved and the builds are healthy.

On 17 February 2018 at 13:41, Sameer Agarwal 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.3.0. 
The vote is open until Thursday February 22, 2018 at 8:00:00 am UTC and passes 
if a majority of at least 3 PMC +1 votes are cast.


[ ] +1 Release this package as Apache Spark 2.3.0

[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v2.3.0-rc4: 
https://github.com/apache/spark/tree/v2.3.0-rc4 
(44095cb65500739695b0324c177c19dfa1471472)

List of JIRA tickets resolved in this release can be found here: 
https://issues.apache.org/jira/projects/SPARK/versions/12339551

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

Release artifacts are signed with the following key:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1265/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html


FAQ

===
What are the unresolved issues targeted for 2.3.0?
===

Please see https://s.apache.org/oXKi. At the time of writing, there are 
currently no known release blockers.

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

If you're working in PySpark you can set up a virtual env and install the 
current RC and see if anything important breaks, in the Java/Scala you can add 
the staging repository to your projects resolvers and test with the RC (make 
sure to clean up the artifact cache before/after so you don't end up building 
with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.0?
===

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.1 or 2.4.0 as appropriate.

===
Why is my bug not fixed?
===

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.2.0. That being said, if there is 
something which is a regression from 2.2.0 and has not been correctly targeted 
please ping me or a committer to help target the issue (you can see the open 
issues listed as impacting Spark 2.3.0 at https://s.apache.org/WmoI).



--
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag

Re: Help needed in R documentation generation

2018-02-25 Thread Felix Cheung

This is recent change. The html file column_math_functions.html should have the 
right help content.

What is the problem you are experiencing?

From: Mihály Tóth 
Sent: Sunday, February 25, 2018 10:42:50 PM
To: dev@spark.apache.org
Subject: Help needed in R documentation generation

Hi,

I am having difficulties generating R documentation.

In R/pkg/html/index.html file at the individual function entries it reference
column_math_functions.html instead of the function page itself. Like

asin

Have you met with such a problem?

Thanks,

  Misi

Re: Timeline for Spark 2.3

2017-12-20 Thread Felix Cheung

+1
I think the earlier we cut a branch the better.

From: Michael Armbrust 
Sent: Tuesday, December 19, 2017 4:41:44 PM
To: Holden Karau
Cc: Sameer Agarwal; Erik Erlandson; dev
Subject: Re: Timeline for Spark 2.3

Do people really need to be around for the branch cut (modulo the person 
cutting the branch)?

1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as 
we enter the new year :)

Michael

On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau 
> wrote:
Sounds reasonable, although I'd choose the 2nd perhaps just since lots of folks 
are off on the 1st?

On Tue, Dec 19, 2017 at 4:36 PM, Sameer Agarwal 
> wrote:
Let's aim for the 2.3 branch cut on 1st Jan and RC1 a week after that (i.e., 
week of 8th Jan)?

On Fri, Dec 15, 2017 at 12:54 AM, Holden Karau 
> wrote:
So personally I’d be in favour or pushing to early January, doing a release 
over the holidays is a little rough with herding all of people to vote.

On Thu, Dec 14, 2017 at 11:49 PM Erik Erlandson 
> wrote:
I wanted to check in on the state of the 2.3 freeze schedule.  Original 
proposal was "late Dec", which is a bit open to interpretation.

We are working to get some refactoring done on the integration testing for the 
Kubernetes back-end in preparation for testing upcoming release candidates, 
however holiday vacation time is about to begin taking its toll both on 
upstream reviewing and on the "downstream" spark-on-kube fork.

If the freeze pushed into January, that would take some of the pressure off the 
kube back-end upstreaming. However, regardless, I was wondering if the dates 
could be clarified.
Cheers,
Erik

On Mon, Nov 13, 2017 at 5:13 PM, dji...@dataxu.com 
> wrote:
Hi,

What is the process to request an issue/fix to be included in the next
release? Is there a place to vote for features?
I am interested in https://issues.apache.org/jira/browse/SPARK-13127, to see
if we can get Spark upgrade parquet to 1.9.0, which addresses the
https://issues.apache.org/jira/browse/PARQUET-686.
Can we include the fix in Spark 2.3 release?

Thanks,

Dong

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

--
Twitter: https://twitter.com/holdenkarau

--
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag

--
Twitter: https://twitter.com/holdenkarau

1 2 3 >

1 - 100 of 237 matches

Mail list logo