Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:47 PM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 You’d need to add them as a contributor in the JIRA admin page. Once you
 do that, you should be able to assign the jira to that person



Is this documented, and does every PMC (or committer) have access to do
that ?




 Thanks, Hari

 On Thu, Apr 23, 2015 at 5:33 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:

  A related question that has affected me in the past: If we get a PR from
 a
  new developer I sometimes find that I am not able to assign an issue to
  them after merging the PR. Is there a process we need follow to get new
  contributors on to a particular group in JIRA ? Or does it somehow happen
  automatically ?
  Thanks
  Shivaram
  On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote:
  Following my comment earlier that I think we set Assignee for Fixed
  JIRAs consistently, I found there are actually 880 counter examples.
  Lots of them are old, and I'll try to fix as many that are recent (for
  the 1.4.0 release credits) as I can stand to click through.
 
  Let's set Assignee after resolving consistently though. In various
  ways I've heard that people do really like the bit of credit, and I
  don't think anybody disputed setting Assignee *after* it was resolved
  as a way of giving credit.
 
  People who know they're missing a credit are welcome to ping me
  directly to get it fixed.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 




-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Let's set Assignee for Fixed JIRAs

2015-04-23 Thread Luciano Resende
On Thu, Apr 23, 2015 at 5:26 PM, Sean Owen so...@cloudera.com wrote:

 Following my comment earlier that I think we set Assignee for Fixed
 JIRAs consistently, I found there are actually 880 counter examples.
 Lots of them are old, and I'll try to fix as many that are recent (for
 the 1.4.0 release credits) as I can stand to click through.

 Let's set Assignee after resolving consistently though. In various
 ways I've heard that people do really like the bit of credit, and I
 don't think anybody disputed setting Assignee *after* it was resolved
 as a way of giving credit.

 People who know they're missing a credit are welcome to ping me
 directly to get it fixed.



+1, this will help with giving people the bit of credit, but I guess it
also helps on recognizing the community contributors towards becoming
committers much easier.



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-23 Thread Luciano Resende
On Fri, Aug 21, 2015 at 9:28 AM, Sean Owen so...@cloudera.com wrote:

 Signatures, license, etc. look good. I'm getting some fairly
 consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
 -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
 are likely just test problems, but worth asking. Stack traces are at
 the end.

 There are currently 79 issues targeted for 1.5.0, of which 19 are
 bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
 That's significantly better than at the last release. I presume a lot
 of what's still targeted is not critical and can now be
 untargeted/retargeted.

 It occurs to me that the flurry of planning that took place at the
 start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
 the kind of thing that would be even more useful at the start of a
 release cycle. So would be great to do this for 1.6 in a few weeks.
 Indeed there are already 267 issues targeted for 1.6.0 -- a decent
 roadmap already.


 Test failures:

 Core

 - Unpersisting TorrentBroadcast on executors and driver in distributed
 mode *** FAILED ***
   java.util.concurrent.TimeoutException: Can't find 2 executors before
 1 milliseconds elapsed
   at
 org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
   at
 org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
   at org.apache.spark.broadcast.BroadcastSuite.org
 $apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
   at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
   at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at
 org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...

 Streaming

 - stop slow receiver gracefully *** FAILED ***
   0 was not greater than 0 (StreamingContextSuite.scala:324)

 Kafka

 - offset recovery *** FAILED ***
   The code passed to eventually never returned normally. Attempted 191
 times over 10.043196973 seconds. Last failure message:
 strings.forall({
 ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
   }) was false. (DirectKafkaStreamSuite.scala:249)



Hi Sean,

Were you able to resolve this ? I am trying it on Linux (Ubuntu 14.04.3
LTS) and when building with a clean maven repo I am getting issues where it
can't find lib_managed/jar when trying to build the Launcher. But looks
like you went a bit further on this ?

Running org.apache.spark.launcher.SparkSubmitCommandBuilderSuite
Tests run: 7, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 0.018 sec
 FAILURE! - in org.apache.spark.launcher.SparkSubmitCommandBuilderSuite
testDriverCmdBuilder(org.apache.spark.launcher.SparkSubmitCommandBuilderSuite)
Time elapsed: 0.005 sec   ERROR!
java.lang.IllegalStateException: Library directory
'/home/lresende/dev/spark/source/releases/spark-1.5.0/lib_managed/jars'
does not exist.
at
org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
at
org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:218)
at
org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:115)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:196)
at
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
at
org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testCmdBuilder(SparkSubmitCommandBuilderSuite.java:174)
at
org.apache.spark.launcher.SparkSubmitCommandBuilderSuite.testDriverCmdBuilder(SparkSubmitCommandBuilderSuite.java:51)


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Bringing up JDBC Tests to trunk

2015-10-21 Thread Luciano Resende
I have started looking into PR-8101 [1] and what is required to merge it
into trunk which will also unblock me around SPARK-10521 [2].

So here is the minimal plan I was thinking about :

- make the docker image version fixed so we make sure we are using the same
image all the time
- pull the required images on the Jenkins executors so tests are not
delayed/timedout because it is waiting for docker images to download
- create a profile to run the JDBC tests
- create daily jobs for running the JDBC tests


In parallel, I learned that Alan Chin from my team is working with the
AmpLab team to expand the build capacity for Spark, so I will use some of
the nodes he is preparing to test/run these builds for now.

Please let me know if there is anything else needed around this.


[1] https://github.com/apache/spark/pull/8101
[2] https://issues.apache.org/jira/browse/SPARK-10521

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-02 Thread Luciano Resende
 add newer versions of Hadoop for use by
> spark-ec2
> >> >> > and
> >> >> > similar tools, or should we just be getting that stuff via an
> Apache
> >> >> > mirror?
> >> >> > The latest version is 2.7.1, by the way.
> >> >> >
> >> >> >
> >> >> > you should be grabbing the artifacts off the ASF and then verifying
> >> >> > their
> >> >> > SHA1 checksums as published on the ASF HTTPS web site
> >> >> >
> >> >> >
> >> >> > The problem with the Apache mirrors, if I am not mistaken, is that
> >> >> > you
> >> >> > cannot use a single URL that automatically redirects you to a
> working
> >> >> > mirror
> >> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> >> > doesn't
> >> >> > disappear tomorrow.
> >> >> >
> >> >> >
> >> >> > They don't go away, especially http://mirror.ox.ac.uk , and in
> the us
> >> >> > the
> >> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> >> >> > kept.
> >> >> >
> >> >> > full list with availability stats
> >> >> >
> >> >> > http://www.apache.org/mirrors/
> >> >> >
> >> >> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Luciano Resende
+1 (non-binding) mostly looking in the legal aspects of the release.

On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.4.1!

 This release fixes a handful of known issues in Spark 1.4.0, listed here:
 http://s.apache.org/spark-1.4.1

 The tag to be voted on is v1.4.1-rc4 (commit dbaa5c2):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 dbaa5c294eb565f84d7032e387e4b8c1a56e4cd2

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.1]
 https://repository.apache.org/content/repositories/orgapachespark-1125/
 [published as version: 1.4.1-rc4]
 https://repository.apache.org/content/repositories/orgapachespark-1126/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-docs/

 Please vote on releasing this package as Apache Spark 1.4.1!

 The vote is open until Sunday, July 12, at 06:55 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


JDBC Dialect tests

2015-09-14 Thread Luciano Resende
I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that
supposedly is testing MySQL and PostgreSQL using Docker and it seems that
this code has been removed. Could anyone provide me a pointer on where are
these tests actually located at the moment, and how they are integrated
with the Spark build ?

My goal is to integrate some DB2 JDBC Dialect tests as mentioned in
SPARK-10521



[1] https://issues.apache.org/jira/browse/SPARK-9818
[2] https://issues.apache.org/jira/browse/SPARK-6136
[3] https://issues.apache.org/jira/browse/SPARK-10521

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: JDBC Dialect tests

2015-09-17 Thread Luciano Resende
Thanks Reynold,

Also, what is the status of the associated PR are we planning to merge it
soon ? This will help me with the Db2 dialect test framework using Docker.

Thanks

[1] https://github.com/apache/spark/pull/8101

On Mon, Sep 14, 2015 at 1:47 PM, Reynold Xin <r...@databricks.com> wrote:

> SPARK-9818 you link to actually links to a pull request trying to bring
> them back.
>
>
> On Mon, Sep 14, 2015 at 1:34 PM, Luciano Resende <luckbr1...@gmail.com>
> wrote:
>
>> I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that
>> supposedly is testing MySQL and PostgreSQL using Docker and it seems that
>> this code has been removed. Could anyone provide me a pointer on where are
>> these tests actually located at the moment, and how they are integrated
>> with the Spark build ?
>>
>> My goal is to integrate some DB2 JDBC Dialect tests as mentioned in
>> SPARK-10521
>>
>>
>>
>> [1] https://issues.apache.org/jira/browse/SPARK-9818
>> [2] https://issues.apache.org/jira/browse/SPARK-6136
>> [3] https://issues.apache.org/jira/browse/SPARK-10521
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Open Issues for Contributors

2015-09-22 Thread Luciano Resende
You can use Jira filters to narrow down the scope of issues you want to
possible address, for instance, I use this filter to look into open issues,
that are unassigned :

https://issues.apache.org/jira/issues/?filter=12333428

For a specific release, you can also filter the release, and I Reynold had
sent this a few days ago for 1.5.1

https://issues.apache.org/jira/issues/?filter=1221


On Tue, Sep 22, 2015 at 8:50 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Where is the best place to look at open issues that haven't been
> assigned/started for the next release? I am interested in working on
> something, but I don't know what issues are higher priority for the next
> release.
>
> On a similar note, is there somewhere which outlines the overall goals for
> the next release (be it 1.5.1 or 1.6) with some parent issues along with
> smaller child issues to work on (like the built ins ticket from 1.5)?
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: SparkR package path

2015-09-24 Thread Luciano Resende
>> >>
>> >>
>> >>
>> >>
>> >> --Hossein
>> >>
>> >>
>> >>
>> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> >> <shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> As Rui says it would be good to understand the use case we want to
>> >> support (supporting CRAN installs could be one for example). I don't
>> >> think it should be very hard to do as the RBackend itself doesn't use
>> >> the R source files. The RRDD does use it and the value comes from
>> >>
>> >>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> >> AFAIK -- So we could introduce a new config flag that can be used for
>> >> this new mode.
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >>
>> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <rui@intel.com> wrote:
>> >> > Hossein,
>> >> >
>> >> >
>> >> >
>> >> > Any strong reason to download and install SparkR source package
>> >> > separately
>> >> > from the Spark distribution?
>> >> >
>> >> > An R user can simply download the spark distribution, which contains
>> >> > SparkR
>> >> > source and binary package, and directly use sparkR. No need to
>> install
>> >> > SparkR package at all.
>> >> >
>> >> >
>> >> >
>> >> > From: Hossein [mailto:fal...@gmail.com]
>> >> > Sent: Tuesday, September 22, 2015 9:19 AM
>> >> > To: dev@spark.apache.org
>> >> > Subject: SparkR package path
>> >> >
>> >> >
>> >> >
>> >> > Hi dev list,
>> >> >
>> >> >
>> >> >
>> >> > SparkR backend assumes SparkR source files are located under
>> >> > "SPARK_HOME/R/lib/." This directory is created by running
>> >> > R/install-dev.sh.
>> >> > This setting makes sense for Spark developers, but if an R user
>> >> > downloads
>> >> > and installs SparkR source package, the source files are going to be
>> in
>> >> > placed different locations.
>> >> >
>> >> >
>> >> >
>> >> > In the R runtime it is easy to find location of package files using
>> >> > path.package("SparkR"). But we need to make some changes to R backend
>> >> > and/or
>> >> > spark-submit so that, JVM process learns the location of worker.R and
>> >> > daemon.R and shell.R from the R runtime.
>> >> >
>> >> >
>> >> >
>> >> > Do you think this change is feasible?
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> >
>> >> > --Hossein
>> >>
>> >>
>> >
>> >
>>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Luciano Resende
+1 (non-binding)

Compiled in Mac OS with :
build/mvn -Pyarn,sparkr,hive,hive-thriftserver
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package

Checked around R
Looked into legal files

All looks good.


On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
>
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1148/
> <https://repository.apache.org/content/repositories/orgapachespark-1148/>*
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
> present in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Luciano Resende
 partition insert
 - DataSourceRegister interface for external data sources to specify short
 names

 SparkR

 - YARN cluster mode in R
 - GLMs with R formula, binomial/Gaussian families, and elastic-net
 regularization
 - Improved error messages
 - Aliases to make DataFrame functions more R-like

 Streaming

 - Backpressure for handling bursty input streams.
 - Improved Python support for streaming sources (Kafka offsets, Kinesis,
 MQTT, Flume)
 - Improved Python streaming machine learning algorithms (K-Means, linear
 regression, logistic regression)
 - Native reliable Kinesis stream support
 - Input metadata like Kafka offsets made visible in the batch details UI
 - Better load balancing and scheduling of receivers across cluster
 - Include streaming storage in web UI

 Machine Learning and Advanced Analytics

 - Feature transformers: CountVectorizer, Discrete Cosine transformation,
 MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
 - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
 regression.
 - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
 pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
 test.
 - Improvements to existing algorithms: LDA, trees/ensembles, GMMs
 - More efficient Pregel API implementation for GraphX
 - Model summary for linear and logistic regression.
 - Python API: distributed matrices, streaming k-means and linear models,
 LDA, power iteration clustering, etc.
 - Tuning and evaluation: train-validation split and multiclass
 classification evaluator.
 - Documentation: document the release version of public API methods




-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: IntelliJ license for committers?

2015-12-02 Thread Luciano Resende
On Wed, Dec 2, 2015 at 3:02 PM, Reynold Xin <r...@databricks.com> wrote:

> For IntelliJ I think the free version is sufficient for Spark development.
>
>
And I believe anyone with a @apache.org e-mail address can request their
own personal license for InteliJ. That's what  I personally did.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Bringing up JDBC Tests to trunk

2015-12-03 Thread Luciano Resende
On Mon, Nov 30, 2015 at 1:53 PM, Josh Rosen <joshro...@databricks.com>
wrote:

> The JDBC drivers are currently being pulled in as test-scope dependencies
> of the `sql/core` module:
> https://github.com/apache/spark/blob/f2fbfa444f6e8d27953ec2d1c0b3abd603c963f9/sql/core/pom.xml#L91
>
> In SBT, these wind up on the Docker JDBC tests' classpath as a transitive
> dependency of the `spark-sql` test JAR. However, what we *should* be
> doing is adding them as explicit test dependencies of the
> `docker-integration-tests` subproject, since Maven handles transitive test
> JAR dependencies differently than SBT (see
> https://github.com/apache/spark/pull/9876#issuecomment-158593498 for some
> discussion). If you choose to make that fix as part of your PR, be sure to
> move the version handling to the root POM's  section
> so that the versions in both modules stay in sync. We might also be able to
> just simply move the JDBC driver dependencies to docker-integration-tests'
> POM if it turns out that they're not used anywhere else (that's my hunch).
>
>

So, the issue I am having now is that the DB2 JDBC is not available in any
maven public repository, so the plan I am going in with is :

- Before running the DB2 Docker Tests, the client machine needs to download
the jdbc driver locally and install it to it's local maven repository (or
sbt equivalent)  (instructions to be provided in either readme or pom file)

- We would need help with installing the DB2 JDBC on the Jenkins slaves
machines

- We could also create a new profile for the DB2 Docker Tests, so that this
tests are running when this profile is enabled.

I could probably think about other options, but they would sound a lot
hacky.

Thoughts ? Some suggestions ?

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-19 Thread Luciano Resende
rg/jira/browse/SPARK-9836> R-like
>   statistics for GLMs - (Partial) R-like stats for ordinary least
>   squares via summary(model)
>   - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
>   interactions in R formula - Interaction operator ":" in R formula
>- Python API - Many improvements to Python API to approach feature
>parity
>
> Misc improvements
>
>- SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
>SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
>weights for GLMs - Logistic and Linear Regression can take instance
>weights
>- SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
>SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
>and bivariate statistics in DataFrames - Variance, stddev,
>correlations, etc.
>- SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
>data source - LIBSVM as a SQL data sourceDocumentation improvements
>- SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
>versions - Documentation includes initial version when classes and
>methods were added
>- SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
>example code - Automated testing for code in user guide examples
>
> Deprecations
>
>- In spark.mllib.clustering.KMeans, the "runs" parameter has been
>deprecated.
>- In spark.ml.classification.LogisticRegressionModel and
>spark.ml.regression.LinearRegressionModel, the "weights" field has been
>deprecated, in favor of the new name "coefficients." This helps
>disambiguate from instance (row) weights given to algorithms.
>
> Changes of behavior
>
>- spark.mllib.tree.GradientBoostedTrees validationTol has changed
>semantics in 1.6. Previously, it was a threshold for absolute change in
>error. Now, it resembles the behavior of GradientDescent convergenceTol:
>For large errors, it uses relative error (relative to the previous error);
>for small errors (< 0.01), it uses absolute error.
>- spark.ml.feature.RegexTokenizer: Previously, it did not convert
>strings to lowercase before tokenizing. Now, it converts to lowercase by
>default, with an option not to. This matches the behavior of the simpler
>Tokenizer transformer.
>- Spark SQL's partition discovery has been changed to only discover
>partition directories that are children of the given path. (i.e. if
>path="/my/data/x=1" then x=1 will no longer be considered a partition
>but only children of x=1.) This behavior can be overridden by manually
>specifying the basePath that partitioning discovery should start with (
>SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
>- When casting a value of an integral type to timestamp (e.g. casting
>a long value to timestamp), the value is treated as being in seconds
>instead of milliseconds (SPARK-11724
><https://issues.apache.org/jira/browse/SPARK-11724>).
>- With the improved query planner for queries having distinct
>aggregations (SPARK-9241
><https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
>query having a single distinct aggregation has been changed to a more
>robust version. To switch back to the plan generated by Spark 1.5's
>planner, please set spark.sql.specializeSingleDistinctAggPlanning to
>true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Bringing up JDBC Tests to trunk

2015-11-22 Thread Luciano Resende
Hey Josh,

Thanks for helping bringing this up, I have just pushed a WIP PR for
bringing the DB2 tests to be running on Docker, and I have a question about
how the jdbc drivers are actually being setup for the other datasources
(MySQL and PostgreSQL), are these setup directly on the Jenkins slaves ? I
didn't see the jars or anything specific on the pom or other files...


Thanks

On Wed, Oct 21, 2015 at 1:26 PM, Josh Rosen <rosenvi...@gmail.com> wrote:

> Hey Luciano,
>
> This sounds like a reasonable plan to me. One of my colleagues has written
> some Dockerized MySQL testing utilities, so I'll take a peek at those to
> see if there are any specifics of their solution that we should adapt for
> Spark.
>
> On Wed, Oct 21, 2015 at 1:16 PM, Luciano Resende <luckbr1...@gmail.com>
> wrote:
>
>> I have started looking into PR-8101 [1] and what is required to merge it
>> into trunk which will also unblock me around SPARK-10521 [2].
>>
>> So here is the minimal plan I was thinking about :
>>
>> - make the docker image version fixed so we make sure we are using the
>> same image all the time
>> - pull the required images on the Jenkins executors so tests are not
>> delayed/timedout because it is waiting for docker images to download
>> - create a profile to run the JDBC tests
>> - create daily jobs for running the JDBC tests
>>
>>
>> In parallel, I learned that Alan Chin from my team is working with the
>> AmpLab team to expand the build capacity for Spark, so I will use some of
>> the nodes he is preparing to test/run these builds for now.
>>
>> Please let me know if there is anything else needed around this.
>>
>>
>> [1] https://github.com/apache/spark/pull/8101
>> [2] https://issues.apache.org/jira/browse/SPARK-10521
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Luciano Resende
On Mon, Jun 6, 2016 at 12:05 PM, Reynold Xin  wrote:

> The bahir one was a good argument actually. I just clicked the button to
> push it into Maven central.
>
>
Thank You !!!


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Luciano Resende
On Mon, Jun 6, 2016 at 9:51 AM, Sean Owen <so...@cloudera.com> wrote:

> I still don't know where this "severely compromised builds of limited
> usefulness" thing comes from? what's so bad? You didn't veto its
> release, after all. And rightly so: a release doesn't mean "definitely
> works"; it means it was created the right way. It's OK to say it's
> buggy alpha software; this isn't an argument to not really release it.
>
> But aside from that: if it should be used by someone, then who did you
> have in mind?
>
> It would be coherent at least to decide not to make alpha-like
> release, but, we agreed to, which is why this argument sort of
> surprises me.
>
> I share some concerns about piling on Databricks. Nothing here is by
> nature about an organization. However, this release really began in
> response to a thread (which not everyone here can see) about
> Databricks releasing a "2.0.0 preview" option in their product before
> it existed. I presume employees of that company sort of endorse this,
> which has put this same release into the hands of not just developers
> or admins but end users -- even with caveats and warnings.
>
> (And I think that's right!)
>
>

In this case, I would only expect the 2.0.0 preview to be treated as just
any other release, period.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Luciano Resende
On Mon, Jun 6, 2016 at 11:12 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Is there any way to remove artifacts from Maven Central? Maybe that would
> help clean these things up long-term, though it would create problems for
> users who for some reason decide to rely on these previews.
>
> In any case, if people are *really* concerned about this, we should just
> put it there. My thought was that it's better for users to do something
> special to link to this release (e.g. add a reference to the staging repo)
> so that they are more likely to know that it's a special, unstable thing.
> Same thing they do to use snapshots.
>
> Matei
>
>
So, consider this thread started on another project :
https://www.mail-archive.com/dev@bahir.apache.org/msg00038.html

What would be your recommendation ?
   - Start a release based on Apache Spark 2.0.0 preview staging repo ? I
would  reject that...
   - Start a release on a set of artifacts that are going to be deleted ? I
would also reject that

To me, if companies are using the release on their products, and other
projects are relying on the release to provide a way for users to test,
this should be considered as any other release, published permanently,
which at some point will become obsolete and users will move on to more
stable releases.

Thanks



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Luciano Resende
On Mon, Jun 6, 2016 at 10:08 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> I still don't know where this "severely compromised builds of limited
>> usefulness" thing comes from? what's so bad? You didn't veto its
>> release, after all.
>
>
> I simply mean that it was released with the knowledge that there are still
> significant bugs in the preview that definitely would warrant a veto if
> this were intended to be on a par with other releases.  There have been
> repeated announcements to that effect, but developers finding the preview
> artifacts on Maven Central months from now may well not also see those
> announcements and related discussion.  The artifacts will be very stale and
> no longer useful for their limited testing purpose, but will persist in the
> repository.
>
>
A few months from now, why would a developer choose a preview, alpha, beta
compared to the GA 2.0 release ?

As for the being stale part, this is true for every release anyone put out
there.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Labeling Jiras

2016-05-25 Thread Luciano Resende
I recently used labels to mark couple jiras that me and my team have some
interest on them, so it's easier to share a query and check the status on
them. But I noticed that these labels were removed.

Are there any issues with labeling jiras ? Any other suggestions ?



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Luciano Resende
On Wed, May 25, 2016 at 6:53 AM, Marcin Tustin <mtus...@handybook.com>
wrote:

> Would it be useful to start baking docker images? Would anyone find that a
> boon to their testing?
>
>
+1, I had done one (still based on 1.6) for some SystemML experiments, I
could easily get it based on a nightly build.

https://github.com/lresende/docker-spark

One question tough, how often should the image be updated ? every night ?
every week ? I could see if I can automate the build + publish in a CI job
at one of our Jenkins servers (Apache or something)...



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Labeling Jiras

2016-05-25 Thread Luciano Resende
On Wed, May 25, 2016 at 2:33 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think we generally use labels at all except "starter". I
> sometimes remove labels when I'm editing a JIRA otherwise, perhaps to
> make that point. I don't recall doing this recently.
>

We have used for other things in the past, like to identify the big-endian
related issues
https://issues.apache.org/jira/browse/SPARK-15154?jql=labels%20%3D%20big-endian


> However I'd say they should not be used to tag JIRAs for your internal
> purposes. Have you looked at things like JIRA Client from Almworks?
> It's free and I highly recommend it, and IIRC it lets you manage some
> private labels locally.
>
>
The issue with maintaining anything locally is that then it's not easily
sharable (e.g. I can't just send a link to a query)



The question is more like, what issues can be caused by using labels ?


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [ANNOUNCE] Apache Spark 2.0.0-preview release

2016-05-25 Thread Luciano Resende
On Wed, May 25, 2016 at 2:34 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think the project would bless anything but the standard
> release artifacts since only those are voted on. People are free to
> maintain whatever they like and even share it, as long as it's clear
> it's not from the Apache project.
>
>
+1


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Labeling Jiras

2016-05-25 Thread Luciano Resende
On Wed, May 25, 2016 at 3:45 PM, Reynold Xin <r...@databricks.com> wrote:

> I think the risk is everybody starts following this, then this will be
> unmanageable, given the size of the number of organizations involved.
>
> The two main labels that we actually use are starter + releasenotes.
>
>
Well, if we consider the worst case scenario, and we have a jira, let's say
with a few labels, what harm does it make ?


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Luciano Resende
Congratulations Yanbo !!!

On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi all,
>
> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a
> super active contributor in many areas of MLlib. Please join me in
> welcoming Yanbo!
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Luciano Resende
On Wed, Jun 22, 2016 at 7:46 AM, Cody Koeninger <c...@koeninger.org> wrote:

> As far as I know the only thing blocking it at this point is lack of
> committer review / approval.
>
> It's technically adding a new feature after spark code-freeze, but it
> doesn't change existing code, and the kafka project didn't release
> 0.10 until the end of may.
>
>

To be fair with the Kafka 0.10 PR assessment :

I was expecting somewhat an easy transition from customer using 0.80 to
0.10 connector, but the 0.10 seems to have been treated as a completely new
extension, also, there is no python support, no samples on the pr
demonstrating how to use security capabilities and no documentation
updates.

Thanks

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Spark streaming connectors available for testing

2016-06-27 Thread Luciano Resende
The Apache Bahir project is voting a release based on Spark 2.0.0-preview.
https://www.mail-archive.com/dev@bahir.apache.org/msg00085.html

It currently provides the following Apache Spark Streaming connectors:

streaming-akka
streaming-mqtt
streaming-twitter
streaming-zeromq

While we are continuing to work towards a release to support Spark 2.0.0,
we appreciate your help around testing the release and the current Spark
Streaming connectors.

To add the connectors to your scala application, the best way is to build
the source of Bahir with 'mvn clean install' which will make the necessary
dependencies available in your local maven repository and will enable you
to reference the connectors in your application and also submit  your
application to a local Spark test environment utilizing --packages.

Build:
mvn clean install

Add repository to your scala application (build.sbt):
resolvers += "Local Maven Repository" at "file://" +
Path.userHome.absolutePath + "/.m2/repository"

Submit your application to a local Spark test environment:
bin/spark-submit --master spark://127.0.0.1:7077 --packages
org.apache.bahir:spark-streaming-akka_2.11:2.0.0-preview --class
org.apache.spark.examples.streaming.akka.ActorWordCount
~/opensource/apache/bahir/streaming-akka-examples/target/scala-2.11/streaming-akka-examples_2.11-1.0.jar
localhost 


The Bahir community welcomes questions, comments, bug reports and all your
feedback.

http://bahir.apache.org/community/

Thanks


Re: Welcoming two new committers

2016-02-08 Thread Luciano Resende
On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van Hovell
> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
> adding new features, optimizations and APIs. Please join me in welcoming
> Herman and Wenchen.
>
> Matei
>

Congratulations !!!

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Scala 2.11 default build

2016-02-04 Thread Luciano Resende
There were few more issues, I have started tracking them at

https://issues.apache.org/jira/browse/SPARK-13189

On Thu, Feb 4, 2016 at 2:08 AM, Prashant Sharma <scrapco...@gmail.com>
wrote:

> Yes, That should be changed to 2.11.7. Mind sending a patch ?
>
> Prashant Sharma
>
>
>
> On Thu, Feb 4, 2016 at 2:11 PM, zzc <441586...@qq.com> wrote:
>
>> hi, rxin, in pom.xml file, 'scala.version' still is 2.10.5, does  it need
>> to
>> be modified to 2.11.7?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-2-11-default-build-tp16157p16207.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Spark 1.6.1

2016-02-22 Thread Luciano Resende
On Mon, Feb 22, 2016 at 9:08 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> An update: people.apache.org has been shut down so the release scripts
> are broken. Will try again after we fix them.
>
>
If you skip uploading to people.a.o, it should still be available in nexus
for review.

The other option is to add the RC into
https://dist.apache.org/repos/dist/dev/



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Luciano Resende
If the intention is to actually decouple and give a life of it's own to
these connectors, I would have expected that they would still be hosted as
different git repositories inside Apache even tough users will not really
see much difference as they would still be mirrored in GitHub. This makes
it much easier on the legal departments of the upstream consumers and
customers as well because the code still follow the so well received and
trusted Apache Governance and Apache Release Policies. As for
implementation details, we can have multiple repositories if we see a lot
of fragmented releases, or a single "connectors" repository which in our
side would make administration more easily.

On Thu, Mar 17, 2016 at 2:33 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Hi Reynold, thanks for the info.
>
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin <r...@databricks.com> wrote:
> > If one really feels strongly that we should go through all the overhead
> to
> > setup an ASF subproject for these modules that won't work with the new
> > structured streaming, and want to spearhead to setup separate repos
> > (preferably one subproject per connector), CI, separate JIRA, governance,
> > READMEs, voting, we can discuss that. Until then, I'd keep the github
> option
> > open because IMHO it is what works the best for end users (including
> > discoverability, issue tracking, release publishing, ...).
>

Agree that there might be a little overhead, but there are ways to minimize
this, and I am sure there are volunteers willing to help in favor of having
a more unifying project. Breaking things into multiple projects, and having
to manage the matrix of supported versions will be hell worst overhead.


>
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
>
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
>
>
Subprojects or even if we send this back to incubator as "connectors
project" is better then public github per package in my opinion.



Now, if with this move is signalizing to customers that the Streaming API
as in 1.x is going away in favor the new structure streaming APIs , then I
guess this is a complete different discussion.


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: SPARK-13843 and future of streaming backends

2016-03-18 Thread Luciano Resende
On Fri, Mar 18, 2016 at 10:07 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> Hi Steve, thanks for the write up.
>
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
> > If you want a separate project, eg. SPARK-EXTRAS, then it *generally*
> needs to go through incubation. While normally its the incubator PMC which
> sponsors/oversees the incubating project, it doesn't have to be the case:
> the spark project can do it.
> >
> > Also Apache Arrow managed to make it straight to toplevel without that
> process. Given that the spark extras are already ASF source files, you
> could try the same thing, add all the existing committers, then look for
> volunteers to keep things.
>
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
>
>
It can have multiple repos, but this still brings the overhead into the PMC
to maintain it which was brought on previously on this thread and it might
not be the direction the PMC want to take (but I might be mistaken).

Another approach is to make this extras, just a subproject, with it's own
set of committers etc which gives less burden on the Spark PMC.

Anyway, my main issue here is not who and how it's going to be managed, but
that it continues under Apache governance.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-28 Thread Luciano Resende
+1, I also checked with few projects inside IBM that consume Spark and they
seem to be ok with the direction of droping JDK 7.

On Mon, Mar 28, 2016 at 11:24 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> +1 from Mesosphere
>
> On Mon, Mar 28, 2016 at 5:12 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
>
>>
>> > On 25 Mar 2016, at 01:59, Mridul Muralidharan <mri...@gmail.com> wrote:
>> >
>> > Removing compatibility (with jdk, etc) can be done with a major
>> release- given that 7 has been EOLed a while back and is now unsupported,
>> we have to decide if we drop support for it in 2.0 or 3.0 (2+ years from
>> now).
>> >
>> > Given the functionality & performance benefits of going to jdk8, future
>> enhancements relevant in 2.x timeframe ( scala, dependencies) which
>> requires it, and simplicity wrt code, test & support it looks like a good
>> checkpoint to drop jdk7 support.
>> >
>> > As already mentioned in the thread, existing yarn clusters are
>> unaffected if they want to continue running jdk7 and yet use spark2
>> (install jdk8 on all nodes and use it via JAVA_HOME, or worst case
>> distribute jdk8 as archive - suboptimal).
>>
>> you wouldn't want to dist it as an archive; it's not just the binaries,
>> it's the install phase. And you'd better remember to put the JCE jar in on
>> top of the JDK for kerberos to work.
>>
>> setting up environment vars to point to JDK8 in the launched
>> app/container avoids that. Yes, the ops team do need to install java, but
>> if you offer them the choice of "installing a centrally managed Java" and
>> "having my code try and install it", they should go for the managed option.
>>
>> One thing to consider for 2.0 is to make it easier to set up those env
>> vars for both python and java. And, as the techniques for mixing JDK
>> versions is clearly not that well known, documenting it.
>>
>> (FWIW I've done code which even uploads it's own hadoop-* JAR, but what
>> gets you is changes in the hadoop-native libs; you do need to get the PATH
>> var spot on)
>>
>>
>> > I am unsure about mesos (standalone might be easier upgrade I guess ?).
>> >
>> >
>> > Proposal is for 1.6x line to continue to be supported with critical
>> fixes; newer features will require 2.x and so jdk8
>> >
>> > Regards
>> > Mridul
>> >
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Luciano Resende
ing-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan <mri...@gmail.com>
> wrote:
> >> I am not referring to code edits - but to migrating submodules and
> >> code currently in Apache Spark to 'outside' of it.
> >> If I understand correctly, assets from Apache Spark are being moved
> >> out of it into thirdparty external repositories - not owned by Apache.
> >>
> >> At a minimum, dev@ discussion (like this one) should be initiated.
> >> As PMC is responsible for the project assets (including code), signoff
> >> is required for it IMO.
> >>
> >> More experienced Apache members might be opine better in case I got it
> wrong !
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >>> Why would a PMC vote be necessary on every code deletion?
> >>>
> >>> There was a Jira and pull request discussion about the submodules that
> >>> have been removed so far.
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13843
> >>>
> >>> There's another ongoing one about Kafka specifically
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13877
> >>>
> >>>
> >>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mri...@gmail.com>
> wrote:
> >>>>
> >>>> I was not aware of a discussion in Dev list about this - agree with
> most of
> >>>> the observations.
> >>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>>>
> >>>> Regards
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Thursday, March 17, 2016, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>>>>
> >>>>> Hello all,
> >>>>>
> >>>>> Recently a lot of the streaming backends were moved to a separate
> >>>>> project on github and removed from the main Spark repo.
> >>>>>
> >>>>> While I think the idea is great, I'm a little worried about the
> >>>>> execution. Some concerns were already raised on the bug mentioned
> >>>>> above, but I'd like to have a more explicit discussion about this so
> >>>>> things don't fall through the cracks.
> >>>>>
> >>>>> Mainly I have three concerns.
> >>>>>
> >>>>> i. Ownership
> >>>>>
> >>>>> That code used to be run by the ASF, but now it's hosted in a github
> >>>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>>>> problematic.
> >>>>>
> >>>>> ii. Governance
> >>>>>
> >>>>> Similar to the above; who has commit access to the above repos? Will
> >>>>> all the Spark committers, present and future, have commit access to
> >>>>> all of those repos? Are they still going to be considered part of
> >>>>> Spark and have release management done through the Spark community?
> >>>>>
> >>>>>
> >>>>> For both of the questions above, why are they not turned into
> >>>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>>>> a mechanism to do that, without the need to keep the code in the main
> >>>>> Spark repo, right?
> >>>>>
> >>>>> iii. Usability
> >>>>>
> >>>>> This is another thing I don't see discussed. For Scala-based code
> >>>>> things don't change much, I guess, if the artifact names don't change
> >>>>> (another reason to keep things in the ASF?), but what about python?
> >>>>> How are pyspark users expected to get that code going forward, since
> >>>>> it's not in Spark's pyspark.zip anymore?
> >>>>>
> >>>>>
> >>>>> Is there an easy way of keeping these things within the ASF Spark
> >>>>> project? I think that would be better for everybody.
> >>>>>
> >>>>> --
> >>>>> Marcelo
> >>>>>
> >>>>> -
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>>>
> >>>>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-03-26 Thread Luciano Resende
On Sat, Mar 26, 2016 at 10:20 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Luciano,
>
> If we take the "pure" technical vision, there's pros and cons of having
> spark-extra (or whatever the name we give) still as an Apache project:
>
> Pro:
>  - Governance & Quality Insurance: we follow the Apache rules, meaning
> that a release has to be staged and voted by the PMC. It's a form of
> governance of the project and quality (as the releases are reviewed).
>  - Software origin: users know where the connector comes from, and they
> have the guarantee in term of licensing, etc.
>  - IP/ICLA: We know the committers of this project, and we know they agree
> with the ICL agreement.
>
> Cons:
>  - Third licenses support. As an Apache project, the "connectors" will be
> allowed to use only Apache or Category B licensed dependencies. For
> instance, if I would like to create a Spark connector for couchbase, I
> can't do it at Apache.
>

Yes, this is not solving the incompatible license problems


>  - Release cycle. As an Apache project, it means we have to follow the
> rules, meaning that the release cycle can appear strict and long due to the
> staging and vote process. For me, it's a huge benefit but some can see as
> too strict ;)
>

IMHO, This is the small price we pay for all the good stuff you mentioned
in pro


>
> Maybe, we can imagine both, as we have in ServiceMix or Camel:
> - all modules/connectors matching the Apache rule (especially in term of
> licensing) should be in the Apache Spark-Modules (or Spark-Extensions, or
> whatever). It's like the ServiceMix Bundles.
>

If you are talking here about Spark proper, then we are currently seeing
that this is going to be hard. If there was a way to have more flexibility
to host these directly into Spark proper, I would never be creating this
thread as we would have all the pros you mentioned hosting them directly
into Spark.


> - all modules/connectors that can't fit into the Apache rule (due to
> licensing issue) can go into GitHub Spark-Extra (or Spark-Package). It's
> like the ServiceMix Extra or Camel Extra on github.
>
>
We could look into this, but it might be a "Spark Extra  discussion" on how
we can help foster a community around the non-compatible licensed
connectors.


> My $0.01.
>
> Regards
> JB
>
>
> On 03/26/2016 06:07 PM, Luciano Resende wrote:
>
>> I believe some of this has been resolved in the context of some parts
>> that had interest in one extra connector, but we still have a few
>> removed, and as you mentioned, we still don't have a simple way or
>> willingness to manage and be current on new packages like kafka. And
>> based on the fact that this thread is still alive, I believe that other
>> community members might have other concerns as well.
>>
>> After some thought, I believe having a separate project (what was
>> mentioned here as Spark Extras) to handle Spark Connectors and Spark
>> add-ons in general could be very beneficial to Spark and the overall
>> Spark community, which would have a central place in Apache to
>> collaborate around related Spark components.
>>
>> Some of the benefits on this approach
>>
>> - Enables maintaining the connectors inside Apache, following the Apache
>> governance and release rules, while allowing Spark proper to focus on
>> the core runtime.
>> - Provides more flexibility in controlling the direction (currency) of
>> the existing connectors (e.g. willing to find a solution and maintain
>> multiple versions of same connectors like kafka 0.8x and 0.9x)
>> - Becomes a home for other types of Spark related connectors helping
>> expanding the community around Spark (e.g. Zeppelin see most of it's
>> current contribution around new/enhanced connectors)
>>
>> What are some requirements for Spark Extras to be successful:
>>
>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>> SNAPSHOT)
>> - Adhere to Spark release cycles (have a very little window compared to
>> Spark release)
>> - Be more open and flexible to the set of connectors it will accept and
>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>> have today)
>>
>> Where to start Spark Extras
>>
>> Depending on the interest here, we could follow the steps of (Apache
>> Arrow) and start this directly as a TLP, or start as an incubator
>> project. I would consider the first option first.
>>
>> Who would participate
>>
>> Have thought about this for a bit, and if we go to the direction of TLP,
>> I would say Spark Committers and Apache Members can request 

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-04 Thread Luciano Resende
>>>>
>>>>> I understand both sides. But if you look at what I've been asking
>>>>> since the beginning, it's all about the cost and benefits of dropping
>>>>> support for java 1.7.
>>>>>
>>>>> The biggest argument in your original e-mail is about testing. And the
>>>>> testing cost is much bigger for supporting scala 2.10 than it is for
>>>>> supporting java 1.7. If you read one of my earlier replies, it should
>>>>> be even possible to just do everything in a single job - compile for
>>>>> java 7 and still be able to test things in 1.8, including lambdas,
>>>>> which seems to be the main thing you were worried about.
>>>>>
>>>>>
>>>>> > On Thu, Mar 24, 2016 at 4:48 PM, Marcelo Vanzin <van...@cloudera.com>
>>>>> wrote:
>>>>> >>
>>>>> >> On Thu, Mar 24, 2016 at 4:46 PM, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>> >> > Actually it's *way* harder to upgrade Scala from 2.10 to 2.11,
>>>>> than
>>>>> >> > upgrading the JVM runtime from 7 to 8, because Scala 2.10 and
>>>>> 2.11 are
>>>>> >> > not
>>>>> >> > binary compatible, whereas JVM 7 and 8 are binary compatible
>>>>> except
>>>>> >> > certain
>>>>> >> > esoteric cases.
>>>>> >>
>>>>> >> True, but ask anyone who manages a large cluster how long it would
>>>>> >> take them to upgrade the jdk across their cluster and validate all
>>>>> >> their applications and everything... binary compatibility is a tiny
>>>>> >> drop in that bucket.
>>>>> >>
>>>>> >> --
>>>>> >> Marcelo
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Marcelo
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>>
>>>>>
>>>
>>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-28 Thread Luciano Resende
Just want to provide a quick update that we have submitted the "Spark
Extras" proposal for review by the Apache board (see link below with the
contents).

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing

Note that we are in the quest for a project name that does not have "Spark"
as part of it, and we will provide an update here when we find a suitable
name. Suggestions are welcome (please send them directly to my inbox to
avoid flooding the mailing list).

Thanks


On Sun, Apr 17, 2016 at 9:16 AM, Luciano Resende <luckbr1...@gmail.com>
wrote:

>
>
> On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin <r...@apache.org> wrote:
>
>> First, really thank you for leading the discussion.
>>
>> I am concerned that it'd hurt Spark more than it helps. As many others
>> have pointed out, this unnecessarily creates a new tier of connectors or
>> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
>> We can alleviate this concern by not having "Spark" in the name, and the
>> project proposal and documentation should label clearly that this is not
>> affiliated with Spark.
>>
>
> I really thought we could use the Spark name (e.g. similar to
> spark-packages) as this project is really aligned and dedicated to curating
> extensions to Apache Spark and that's why we were inviting Spark PMC
> members to join the new project PMC so that Apache Spark has the necessary
> oversight and influence on the project direction. I understand folks have
> concerns with the name, and thus we will start looking into name
> alternatives unless there is any way I could address the community concerns
> around this.
>
>
>>
>> Also Luciano - assuming you are interested in creating a project like
>> this and find a home for the connectors that were removed, I find it
>> surprising that few of the initially proposed PMC members have actually
>> contributed much to the connectors, and people that have contributed a lot
>> were left out. I am sure that is just an oversight.
>>
>>
> Reynold, thanks for your concern, we are not leaving anyone out, we took
> the following criteria to identify initial PMC/Committers list as described
> on the first e-mail on this thread:
>
>- Spark Committers and Apache Members can request to participate as PMC
> members
>- All active spark committers (committed on the last one year) will
> have write access to the project (committer access)
>- Other committers can request to become committers.
>- Non committers would be added based on meritocracy after the start of
> the project.
>
> Based on this criteria, all people that have expressed interest in joining
> the project PMC has been added to it, but I don't feel comfortable adding
> names to it at my will. And I have updated the list of committers and
> currently we have the following on the draft proposal:
>
>
> Initial PMC
>
>
>-
>
>Luciano Resende (lresende AT apache DOT org) (Apache Member)
>-
>
>Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
>board member)
>-
>
>Steve Loughran (stevel AT apache DOT org) (Apache Member)
>-
>
>Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
>-
>
>Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
>committer)
>-
>
>Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
>-
>
>Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)
>
>
> Initial Committers (write access to active Spark committers that have
> committed in the last one year)
>
>
>-
>
>Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
>-
>
>Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
>-
>
>Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
>-
>
>Davies Liu (davies AT apache DOT org) (Apache Spark)
>-
>
>DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
>-
>
>Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
>-
>
>Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
>-
>
>Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
>-
>
>Imran Rashid (irashid AT apache DOT org) (Apache Spark)
>-
>
>Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
>-
>
>Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
>-
>
>Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
>-
>
>Cheng Lian (lian AT apache DOT org) (Apache Spark)
>-
>

Re: Spark driver and yarn behavior

2016-05-19 Thread Luciano Resende
On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
shankarvenkataraman...@gmail.com> wrote:

> Hi!
>
> We are running into an interesting behavior with the Spark driver. We
> Spark running under Yarn. The spark driver seems to be sending work to a
> dead executor for 3 hours before it recognizes it. The workload seems to
> have been processed by other executors just fine and we see no loss in
> overall through put. This Jira -
> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
> similar behavior.
>
> The yarn resource manager log indicates the following:
>
> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
> Timed out after 600 secs
> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
> (RMNodeImpl.java:transition(746)) - Deactivating Node 
> dn-a01.example.org:45454 as it is now LOST
>
> The Executor is not reachable for 10 minutes according to this log message
> but the Excutor's log shows plenty of RDD processing during that time frame.
> This seems like a pretty big issue because the orphan executor seems to
> cause a memory leak in the Driver and the Driver becomes non-respondent due
> to heavy Full GC.
>
> Has anyone else run into a similar situation?
>
> Thanks for any and all feedback / suggestions.
>
> Shankar
>
>
I am not sure if this is exactly the same issue, but while we were doing
heavy processing of large history of tweet data via streaming, we were
having similar issues due to the load on the executors, and we bumped some
configurations to avoid loosing some of these executors (even though there
were alive, but busy to heart beat or something)

Some of these are described at
https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Removing module maintainer process

2016-05-22 Thread Luciano Resende
On Sunday, May 22, 2016, Matei Zaharia  wrote:

> It looks like the discussion thread on this has only had positive replies,
> so I'm going to call a VOTE. The proposal is to remove the maintainer
> process in
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
>
> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
>
> Matei
>
>
Thanks Matei, please note that a formal vote should generally be permitted
to run for at least 72 hours to provide an opportunity for all concerned
persons to participate regardless of their geographic locations.


http://www.apache.org/foundation/voting.html

Thank you
-- Luciano





-- 
Sent from my Mobile device


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
On Fri, Apr 15, 2016 at 9:18 AM, Sean Owen <so...@cloudera.com> wrote:

> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>

This whole discussion started when some of the connectors were moved from
Apache to Github, which makes a statement that The "Spark Governance" of
the bits is something very valuable by the community, consumers, and other
companies that are consuming open source code. Being an Apache project also
allows the project to use and share the Apache infrastructure to run the
project.


>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
I know the name might be confusing, but I also think that the projects have
a very big synergy, more like sibling projects, where "Spark Extras"
extends the Spark community and develop/maintain components for, and pretty
much only for, Apache Spark.  Based on your comment above, if making the
project "Spark-Extras" a more acceptable name, I believe this is ok as well.

I also understand that the Spark PMC might have concerns with branding, and
that's why we are inviting all members of the Spark PMC to join the project
and help oversee and manage the project.



>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende <luckbr1...@gmail.com>
> wrote:
> > After some collaboration with other community members, we have created a
> > initial draft for Spark Extras which is available for review at
> >
> >
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
> >
> > We would like to invite other community members to participate in the
> > project, particularly the Spark Committers and PMC (feel free to express
> > interest and I will update the proposal). Another option here is just to
> > give ALL Spark committers write access to "Spark Extras".
> >
> >
> > We also have couple asks from the Spark PMC :
> >
> > - Permission to use "Spark Extras" as the project name. We already
> checked
> > this with Apache Brand Management, and the recommendation was to discuss
> and
> > reach consensus with the Spark PMC.
> >
> > - We would also want to check with the Spark PMC that, in case of
> > successfully creation of  "Spark Extras", if the PMC would be willing to
> > continue the development of the remaining connectors that stayed in Spark
> > 2.0 codebase in the "Spark Extras" project.
> >
> >
> > Thanks in advance, and we welcome any feedback around this proposal
> before
> > we present to the Apache Board for consideration.
> >
> >
> >
> > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende <luckbr1...@gmail.com>
> > wrote:
> >>
> >> I believe some of this has been resolved in the context of some parts
> that
> >> had interest in one extra connector, but we still have a few removed,
> and as
> >> you mentioned, we still don't have a simple way or willingness to
> manage and
> >> be current on new packages like kafka. And based on the fact that this
> >> thread is still alive, I believe that other community members might have
> >> other concerns as well.
> >>
> >> After some thought, I believe having a separate project (what was
> >> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons
> >> in general could be very beneficial to Spark and the overall Spark
> >> community, which would have a central place in Apache to collaborate
> around
> >> related Spark components.
> >>
> >> Some of the benefits on this approach
> >>
> >> - Enables maintaining the connectors inside Apache, following the Apache
> >> governance and release rules, while allowing Spark proper to focus on
> the
> >> core runtime.
> >> - Provides more flexibility in controlling the direction (currency) of
> the
> >> existing connectors (e.g. willing to find a solution and maintain
> multiple
> >> versions of same connectors like kafka 0.8x and 0.9x)
> >> - Becomes a home for other types of Spark relate

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Given that not all of the connectors were removed, I think this
> creates a weird / confusing three tier system
>
> 1. connectors in the official project's spark/extras or spark/external
> 2. connectors in "Spark Extras"
> 3. connectors in some random organization's github
>
>
Agree Cody, and I think this is one of the goals of "Spark Extras",
centralize the development of these connectors under one central place at
Apache, and that's why one of our asks is to invite the Spark PMC to
continue developing the remaining connectors that stayed in Spark proper,
in "Spark Extras". We will also discuss some process policies on enabling
lowering the bar to allow proposal of these other github extensions to be
part of "Spark Extras" while also considering a way to move code to a
maintenance mode location.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
After some collaboration with other community members, we have created a
initial draft for Spark Extras which is available for review at

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing

We would like to invite other community members to participate in the
project, particularly the Spark Committers and PMC (feel free to express
interest and I will update the proposal). Another option here is just to
give ALL Spark committers write access to "Spark Extras".


We also have couple asks from the Spark PMC :

- Permission to use "Spark Extras" as the project name. We already checked
this with Apache Brand Management, and the recommendation was to discuss
and reach consensus with the Spark PMC.

- We would also want to check with the Spark PMC that, in case of
successfully creation of  "Spark Extras", if the PMC would be willing to
continue the development of the remaining connectors that stayed in Spark
2.0 codebase in the "Spark Extras" project.


Thanks in advance, and we welcome any feedback around this proposal before
we present to the Apache Board for consideration.



On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende <luckbr1...@gmail.com>
wrote:

> I believe some of this has been resolved in the context of some parts that
> had interest in one extra connector, but we still have a few removed, and
> as you mentioned, we still don't have a simple way or willingness to manage
> and be current on new packages like kafka. And based on the fact that this
> thread is still alive, I believe that other community members might have
> other concerns as well.
>
> After some thought, I believe having a separate project (what was
> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons in general could be very beneficial to Spark and the overall Spark
> community, which would have a central place in Apache to collaborate around
> related Spark components.
>
> Some of the benefits on this approach
>
> - Enables maintaining the connectors inside Apache, following the Apache
> governance and release rules, while allowing Spark proper to focus on the
> core runtime.
> - Provides more flexibility in controlling the direction (currency) of the
> existing connectors (e.g. willing to find a solution and maintain multiple
> versions of same connectors like kafka 0.8x and 0.9x)
> - Becomes a home for other types of Spark related connectors helping
> expanding the community around Spark (e.g. Zeppelin see most of it's
> current contribution around new/enhanced connectors)
>
> What are some requirements for Spark Extras to be successful:
>
> - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT)
> - Adhere to Spark release cycles (have a very little window compared to
> Spark release)
> - Be more open and flexible to the set of connectors it will accept and
> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> have today)
>
> Where to start Spark Extras
>
> Depending on the interest here, we could follow the steps of (Apache
> Arrow) and start this directly as a TLP, or start as an incubator project.
> I would consider the first option first.
>
> Who would participate
>
> Have thought about this for a bit, and if we go to the direction of TLP, I
> would say Spark Committers and Apache Members can request to participate as
> PMC members, while other committers can request to become committers. Non
> committers would be added based on meritocracy after the start of the
> project.
>
> Project Name
>
> It would be ideal if we could have a project name that shows close ties to
> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
> and support from whoever is going to evaluate the project proposal (e.g.
> Apache Board)
>
>
> Thoughts ?
>
> Does anyone have any big disagreement or objection to moving into this
> direction ?
>
> Otherwise, who would be interested in joining the project, so I can start
> working on some concrete proposal ?
>
>
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-18 Thread Luciano Resende
Evan,

As long as you meet the criteria we discussed on this thread, you are
welcome to join.

Having said that, I have already seen other contributors that are very
active on some of connectors but are not Apache Committers yet, and i
wanted to be fair, and also avoid using the project as an avenue to bring
new committers to Apache.


On Sun, Apr 17, 2016 at 10:07 PM, Evan Chan <vel...@gmail.com> wrote:

> Hi Luciano,
>
> I see that you are inviting all the Spark committers to this new project.
> What about the chief maintainers of important Spark ecosystem projects,
> which are not on the Spark PMC?
>
> For example, I am the chief maintainer of the Spark Job Server, which is
> one of the most active projects in the larger Spark ecosystem.  Would
> projects like this be part of your vision?   If so, it would be a good step
> of faith to reach out to us that maintain the active ecosystem projects.
>  (I’m not saying you should put me in :)  but rather suggesting that if
> this is your aim, it would be good to reach out beyond just the Spark PMC
> members.
>
> thanks,
> Evan
>
> On Apr 17, 2016, at 9:16 AM, Luciano Resende <luckbr1...@gmail.com> wrote:
>
>
>
> On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin <r...@apache.org> wrote:
>
>> First, really thank you for leading the discussion.
>>
>> I am concerned that it'd hurt Spark more than it helps. As many others
>> have pointed out, this unnecessarily creates a new tier of connectors or
>> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
>> We can alleviate this concern by not having "Spark" in the name, and the
>> project proposal and documentation should label clearly that this is not
>> affiliated with Spark.
>>
>
> I really thought we could use the Spark name (e.g. similar to
> spark-packages) as this project is really aligned and dedicated to curating
> extensions to Apache Spark and that's why we were inviting Spark PMC
> members to join the new project PMC so that Apache Spark has the necessary
> oversight and influence on the project direction. I understand folks have
> concerns with the name, and thus we will start looking into name
> alternatives unless there is any way I could address the community concerns
> around this.
>
>
>>
>> Also Luciano - assuming you are interested in creating a project like
>> this and find a home for the connectors that were removed, I find it
>> surprising that few of the initially proposed PMC members have actually
>> contributed much to the connectors, and people that have contributed a lot
>> were left out. I am sure that is just an oversight.
>>
>>
> Reynold, thanks for your concern, we are not leaving anyone out, we took
> the following criteria to identify initial PMC/Committers list as described
> on the first e-mail on this thread:
>
>- Spark Committers and Apache Members can request to participate as PMC
> members
>- All active spark committers (committed on the last one year) will
> have write access to the project (committer access)
>- Other committers can request to become committers.
>- Non committers would be added based on meritocracy after the start of
> the project.
>
> Based on this criteria, all people that have expressed interest in joining
> the project PMC has been added to it, but I don't feel comfortable adding
> names to it at my will. And I have updated the list of committers and
> currently we have the following on the draft proposal:
>
>
> Initial PMC
>
>
>- Luciano Resende (lresende AT apache DOT org) (Apache Member)
>- Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
>board member)
>- Steve Loughran (stevel AT apache DOT org) (Apache Member)
>- Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
>- Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
>committer)
>- Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
>- Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)
>
>
> Initial Committers (write access to active Spark committers that have
> committed in the last one year)
>
>
>- Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
>- Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
>- Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
>- Davies Liu (davies AT apache DOT org) (Apache Spark)
>- DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
>- Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
>- Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
>- Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spar

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-16 Thread Luciano Resende
On Sat, Apr 16, 2016 at 5:38 PM, Evan Chan <velvia.git...@gmail.com> wrote:

> Hi folks,
>
> Sorry to join the discussion late.  I had a look at the design doc
> earlier in this thread, and it was not mentioned what types of
> projects are the targets of this new "spark extras" ASF umbrella
>
> Is the desire to have a maintained set of spark-related projects that
> keep pace with the main Spark development schedule?  Is it just for
> streaming connectors?  what about data sources, and other important
> projects in the Spark ecosystem?
>

The proposal draft below has some more details on what type of projects,
but in summary, "Spark-Extras" would be a good place for any of these
components you mentioned.

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing


>
> I'm worried that this would relegate spark-packages to third tier
> status,


Owen answered a similar question about spark-packages earlier on this
thread, but while "Spark-Extras" would a place in Apache for collaboration
on the development of these extensions, they might still be published to
spark-packages as they existing streaming connectors are today.


> and the promotion of a select set of committers, and the
> project itself, to top level ASF status (a la Arrow) would create a
> further split in the community.
>
>
As for the select set of committers, we have invited all Spark committers
to be committers on the project, and I have updated the project proposal
with the existing set of active Spark committers ( that have committed in
the last one year)


>
> -Evan
>
> On Sat, Apr 16, 2016 at 4:46 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
> >
> >
> >
> >
> >
> > On 15/04/2016, 17:41, "Mattmann, Chris A (3980)" <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >>Yeah in support of this statement I think that my primary interest in
> >>this Spark Extras and the good work by Luciano here is that anytime we
> >>take bits out of a code base and “move it to GitHub” I see a bad
> precedent
> >>being set.
> >>
> >>Creating this project at the ASF creates a synergy between *Apache Spark*
> >>which is *at the ASF*.
> >>
> >>We welcome comments and as Luciano said, this is meant to invite and be
> >>open to those in the Apache Spark PMC to join and help.
> >>
> >>Cheers,
> >>Chris
> >
> > As one of the people named, here's my rationale:
> >
> > Throwing stuff into github creates that world of branches, and its no
> longer something that could be managed through the ASF, where managed is:
> governance, participation and a release process that includes auditing
> dependencies, code-signoff, etc,
> >
> >
> > As an example, there's a mutant hive JAR which spark uses, that's
> something which currently evolved between my repo and Patrick Wendell's;
> now that Josh Rosen has taken on the bold task of "trying to move spark and
> twill to Kryo 3", he's going to own that code, and now the reference branch
> will move somewhere else.
> >
> > In contrast, if there was an ASF location for this, then it'd be
> something anyone with commit rights could maintain and publish
> >
> > (actually, I've just realised life is hard here as the hive is a fork of
> ASF hive —really the spark branch should be a separate branch in Hive's own
> repo ... But the concept is the same: those bits of the codebase which are
> core parts of the spark project should really live in or near it)
> >
> >
> > If everyone on the spark commit list gets write access to this extras
> repo, moving things is straightforward. Release wise, things could/should
> be in sync.
> >
> > If there's a risk, its the eternal problem of the contrib/ dir 
> Stuff ends up there that never gets maintained. I don't see that being any
> worse than if things were thrown to the wind of a thousand github repos: at
> least now there'd be a central issue tracking location.
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-17 Thread Luciano Resende
On Sat, Apr 16, 2016 at 11:12 PM, Reynold Xin <r...@apache.org> wrote:

> First, really thank you for leading the discussion.
>
> I am concerned that it'd hurt Spark more than it helps. As many others
> have pointed out, this unnecessarily creates a new tier of connectors or
> 3rd party libraries appearing to be endorsed by the Spark PMC or the ASF.
> We can alleviate this concern by not having "Spark" in the name, and the
> project proposal and documentation should label clearly that this is not
> affiliated with Spark.
>

I really thought we could use the Spark name (e.g. similar to
spark-packages) as this project is really aligned and dedicated to curating
extensions to Apache Spark and that's why we were inviting Spark PMC
members to join the new project PMC so that Apache Spark has the necessary
oversight and influence on the project direction. I understand folks have
concerns with the name, and thus we will start looking into name
alternatives unless there is any way I could address the community concerns
around this.


>
> Also Luciano - assuming you are interested in creating a project like this
> and find a home for the connectors that were removed, I find it surprising
> that few of the initially proposed PMC members have actually contributed
> much to the connectors, and people that have contributed a lot were left
> out. I am sure that is just an oversight.
>
>
Reynold, thanks for your concern, we are not leaving anyone out, we took
the following criteria to identify initial PMC/Committers list as described
on the first e-mail on this thread:

   - Spark Committers and Apache Members can request to participate as PMC
members
   - All active spark committers (committed on the last one year) will have
write access to the project (committer access)
   - Other committers can request to become committers.
   - Non committers would be added based on meritocracy after the start of
the project.

Based on this criteria, all people that have expressed interest in joining
the project PMC has been added to it, but I don't feel comfortable adding
names to it at my will. And I have updated the list of committers and
currently we have the following on the draft proposal:


Initial PMC


   -

   Luciano Resende (lresende AT apache DOT org) (Apache Member)
   -

   Chris Mattmann (mattmann  AT apache DOT org) (Apache Member, Apache
   board member)
   -

   Steve Loughran (stevel AT apache DOT org) (Apache Member)
   -

   Jean-Baptiste Onofré (jbonofre  AT apache DOT org) (Apache Member)
   -

   Marcelo Masiero Vanzin (vanzin AT apache DOT org) (Apache Spark
   committer)
   -

   Sean R. Owen (srowen AT apache DOT org) (Apache Member and Spark PMC)
   -

   Mridul Muralidharan (mridulm80 AT apache DOT org) (Apache Spark PMC)


Initial Committers (write access to active Spark committers that have
committed in the last one year)


   -

   Andy Konwinski (andrew AT apache DOT org) (Apache Spark)
   -

   Andrew Or (andrewor14 AT apache DOT org) (Apache Spark)
   -

   Ankur Dave (ankurdave AT apache DOT org) (Apache Spark)
   -

   Davies Liu (davies AT apache DOT org) (Apache Spark)
   -

   DB Tsai (dbtsai AT apache DOT org) (Apache Spark)
   -

   Haoyuan Li (haoyuan AT apache DOT org) (Apache Spark)
   -

   Ram Sriharsha (harsha AT apache DOT org) (Apache Spark)
   -

   Herman van Hövell (hvanhovell AT apache DOT org) (Apache Spark)
   -

   Imran Rashid (irashid AT apache DOT org) (Apache Spark)
   -

   Joseph Kurata Bradley (jkbradley AT apache DOT org) (Apache Spark)
   -

   Josh Rosen (joshrosen AT apache DOT org) (Apache Spark)
   -

   Kay Ousterhout (kayousterhout AT apache DOT org) (Apache Spark)
   -

   Cheng Lian (lian AT apache DOT org) (Apache Spark)
   -

   Mark Hamstra (markhamstra AT apache DOT org) (Apache Spark)
   -

   Michael Armbrust (marmbrus AT apache DOT org) (Apache Spark)
   -

   Matei Alexandru Zaharia (matei AT apache DOT org) (Apache Spark)
   -

   Xiangrui Meng (meng AT apache DOT org) (Apache Spark)
   -

   Prashant Sharma (prashant AT apache DOT org) (Apache Spark)
   -

   Patrick Wendell (pwendell AT apache DOT org) (Apache Spark)
   -

   Reynold Xin (rxin AT apache DOT org) (Apache Spark)
   -

   Sanford Ryza (sandy AT apache DOT org) (Apache Spark)
   -

   Kousuke Saruta (sarutak AT apache DOT org) (Apache Spark)
   -

   Shivaram Venkataraman (shivaram AT apache DOT org) (Apache Spark)
   -

   Tathagata Das (tdas AT apache DOT org) (Apache Spark)
   -

   Thomas Graves  (tgraves AT apache DOT org) (Apache Spark)
   -

   Wenchen Fan (wenchen AT apache DOT org) (Apache Spark)
   -

   Yin Huai (yhuai AT apache DOT org) (Apache Spark)
   - Shixiong Zhu (zsxwing AT apache DOT org) (Apache Spark)



BTW, It would be really good to have you on the PMC as well, and any others
that volunteer based on the criteria above. May I add you as PMC to the new
project proposal ?



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Luciano Resende
+ 1 (non-binding)

Found a minor issue when trying to run some of the docker tests, but
nothing blocking the release. Will create a JIRA for that.

On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-25 Thread Luciano Resende
When are we planning to push the release maven artifacts ? We are waiting
for this in order to push an official Apache Bahir release supporting Spark
2.0.

On Sat, Jul 23, 2016 at 7:05 AM, Reynold Xin <r...@databricks.com> wrote:

> The vote has passed with the following +1 votes and no -1 votes. I will
> work on packaging the new release next week.
>
>
> +1
>
> Reynold Xin*
> Sean Owen*
> Shivaram Venkataraman*
> Jonathan Kelly
> Joseph E. Gonzalez*
> Krishna Sankar
> Dongjoon Hyun
> Ricardo Almeida
> Joseph Bradley*
> Matei Zaharia*
> Luciano Resende
> Holden Karau
> Michael Armbrust*
> Felix Cheung
> Suresh Thalamati
> Kousuke Saruta
> Xiao Li
>
>
> * binding votes
>
>
> On July 19, 2016 at 7:35:19 PM, Reynold Xin (r...@databricks.com) wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: spark-packages with maven

2016-07-15 Thread Luciano Resende
On Fri, Jul 15, 2016 at 10:48 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> +1000
>
> Thanks Ismael for bringing this up! I meant to have send it earlier too
> since I've been struggling with a sbt-based Scala project for a Spark
> package myself this week and haven't yet found out how to do local
> publishing.
>
> If such a guide existed for Maven I could use it for sbt easily too :-)
>
> Ping me Ismael if you don't hear back from the group so I feel invited for
> digging into the plugin's sources.
>
> Best,
> Jacek
>
> On 15 Jul 2016 2:29 p.m., "Ismaël Mejía" <ieme...@gmail.com> wrote:
>
> Hello, I would like to know if there is an easy way to package a new
> spark-package
> with maven, I just found this repo, but I am not an sbt user.
>
> https://github.com/databricks/sbt-spark-package
>
> One more question, is there a formal specification or documentation of
> what do
> you need to include in a spark-package (any special file, manifest, etc) ?
> I
> have not found any doc in the website.
>
> Thanks,
> Ismael
>
>
>

I was under the impression that spark-packages was more like a place for
one to list/advertise their extensions,  but when you do spark submit with
--packages, it will use maven to resolve your package
and as long as it succeeds, it will use it (e.g. you can do mvn clean
install for your local packages, and use --packages with a spark server
running on that same machine).

>From sbt, I think you can just use publishTo and define a local repository,
something like

publishTo := Some("Local Maven Repository" at
"file://"+Path.userHome.absolutePath+"/.m2/repository")



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: welcoming Burak and Holden as committers

2017-01-25 Thread Luciano Resende
Congrats to both !!!

On Tue, Jan 24, 2017 at 10:13 AM, Reynold Xin <r...@databricks.com> wrote:

> Hi all,
>
> Burak and Holden have recently been elected as Apache Spark committers.
>
> Burak has been very active in a large number of areas in Spark, including
> linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> DataFrames, dstream, and most recently Structured Streaming.
>
> Holden has been a long time Spark contributor and evangelist. She has
> written a few books on Spark, as well as frequent contributions to the
> Python API to improve its usability and performance.
>
> Please join me in welcoming the two!
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Unable to run docker jdbc integrations test ?

2016-09-07 Thread Luciano Resende
It looks like there is nobody running these tests, and after some
dependency upgrades in Spark 2.0 this has stopped working. I have tried to
bring up this but I am having some issues with getting the right
dependencies loaded and satisfying the docker-client expectations.

The question then is: Does the community find value on having these tests
available ? Then we can focus on bringing them up and I can go push my
previous experiments as a WIP PR. Otherwise we should just get rid of these
tests.

Thoughts ?


On Tue, Sep 6, 2016 at 4:05 PM, Suresh Thalamati <suresh.thalam...@gmail.com
> wrote:

> Hi,
>
>
> I am getting the following error , when I am trying to run jdbc docker
> integration tests on my laptop.   Any ideas , what I might be be doing
> wrong ?
>
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0  -Phive-thriftserver
> -Phive -DskipTests clean install
> build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11
> compile test
>
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> MaxPermSize=512m; support was removed in 8.0
> Discovery starting.
> Discovery completed in 200 milliseconds.
> Run starting. Expected test count is: 10
> MySQLIntegrationSuite:
>
> Error:
> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager
> BlockManagerId(driver, 9.31.117.25, 51868)
> *** RUN ABORTED ***
>   java.lang.AbstractMethodError:
>   at org.glassfish.jersey.model.internal.CommonConfig.
> configureAutoDiscoverableProviders(CommonConfig.java:622)
>   at org.glassfish.jersey.client.ClientConfig$State.
> configureAutoDiscoverableProviders(ClientConfig.java:357)
>   at org.glassfish.jersey.client.ClientConfig$State.
> initRuntime(ClientConfig.java:392)
>   at org.glassfish.jersey.client.ClientConfig$State.access$000(
> ClientConfig.java:88)
>   at org.glassfish.jersey.client.ClientConfig$State$3.get(
> ClientConfig.java:120)
>   at org.glassfish.jersey.client.ClientConfig$State$3.get(
> ClientConfig.java:117)
>   at org.glassfish.jersey.internal.util.collection.Values$
> LazyValueImpl.get(Values.java:340)
>   at org.glassfish.jersey.client.ClientConfig.getRuntime(
> ClientConfig.java:726)
>   at org.glassfish.jersey.client.ClientRequest.getConfiguration(
> ClientRequest.java:285)
>   at org.glassfish.jersey.client.JerseyInvocation.
> validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint:
> MapOutputTrackerMasterEndpoint stopped!
>
>
>
> Thanks
> -suresh
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Unable to run docker jdbc integrations test ?

2016-09-07 Thread Luciano Resende
That might be a reasonable and much more simpler approach to try... but if
we resolve these issues, we should make it part of some frequent build to
make sure the build don't regress and that the actual functionality don't
regress either. Let me look into this again...

On Wed, Sep 7, 2016 at 2:46 PM, Josh Rosen <joshro...@databricks.com> wrote:

> I think that these tests are valuable so I'd like to keep them. If
> possible, though, we should try to get rid of our dependency on the Spotify
> docker-client library, since it's a dependency hell nightmare. Given our
> relatively simple use of Docker here, I wonder whether we could just write
> some simple scripting over the `docker` command-line tool instead of
> pulling in such a problematic library.
>
> On Wed, Sep 7, 2016 at 2:36 PM Luciano Resende <luckbr1...@gmail.com>
> wrote:
>
>> It looks like there is nobody running these tests, and after some
>> dependency upgrades in Spark 2.0 this has stopped working. I have tried to
>> bring up this but I am having some issues with getting the right
>> dependencies loaded and satisfying the docker-client expectations.
>>
>> The question then is: Does the community find value on having these tests
>> available ? Then we can focus on bringing them up and I can go push my
>> previous experiments as a WIP PR. Otherwise we should just get rid of these
>> tests.
>>
>> Thoughts ?
>>
>>
>> On Tue, Sep 6, 2016 at 4:05 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> I am getting the following error , when I am trying to run jdbc docker
>>> integration tests on my laptop.   Any ideas , what I might be be doing
>>> wrong ?
>>>
>>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0
>>> -Phive-thriftserver -Phive -DskipTests clean install
>>> build/mvn -Pdocker-integration-tests -pl 
>>> :spark-docker-integration-tests_2.11
>>> compile test
>>>
>>> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>>> MaxPermSize=512m; support was removed in 8.0
>>> Discovery starting.
>>> Discovery completed in 200 milliseconds.
>>> Run starting. Expected test count is: 10
>>> MySQLIntegrationSuite:
>>>
>>> Error:
>>> 16/09/06 11:52:00 INFO BlockManagerMaster: Registered BlockManager
>>> BlockManagerId(driver, 9.31.117.25, 51868)
>>> *** RUN ABORTED ***
>>>   java.lang.AbstractMethodError:
>>>   at org.glassfish.jersey.model.internal.CommonConfig.
>>> configureAutoDiscoverableProviders(CommonConfig.java:622)
>>>   at org.glassfish.jersey.client.ClientConfig$State.
>>> configureAutoDiscoverableProviders(ClientConfig.java:357)
>>>   at org.glassfish.jersey.client.ClientConfig$State.
>>> initRuntime(ClientConfig.java:392)
>>>   at org.glassfish.jersey.client.ClientConfig$State.access$000(
>>> ClientConfig.java:88)
>>>   at org.glassfish.jersey.client.ClientConfig$State$3.get(
>>> ClientConfig.java:120)
>>>   at org.glassfish.jersey.client.ClientConfig$State$3.get(
>>> ClientConfig.java:117)
>>>   at org.glassfish.jersey.internal.util.collection.Values$
>>> LazyValueImpl.get(Values.java:340)
>>>   at org.glassfish.jersey.client.ClientConfig.getRuntime(
>>> ClientConfig.java:726)
>>>   at org.glassfish.jersey.client.ClientRequest.getConfiguration(
>>> ClientRequest.java:285)
>>>   at org.glassfish.jersey.client.JerseyInvocation.
>>> validateHttpMethodAndEntity(JerseyInvocation.java:126)
>>>   ...
>>> 16/09/06 11:52:00 INFO SparkContext: Invoking stop() from shutdown hook
>>> 16/09/06 11:52:00 INFO MapOutputTrackerMasterEndpoint:
>>> MapOutputTrackerMasterEndpoint stopped!
>>>
>>>
>>>
>>> Thanks
>>> -suresh
>>>
>>>
>>
>>
>> --
>> Luciano Resende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Real time streaming in Spark

2016-08-29 Thread Luciano Resende
There were some prototypes/discussions being done on top of Spark
Streaming, and they were discussing how that would fit with regards to
Structured Streaming which was in design mode at that time. See
https://issues.apache.org/jira/browse/SPARK-14745 for some details and link
to PR.

On Mon, Aug 29, 2016 at 1:13 PM, Tomasz Gawęda <tomasz.gaw...@outlook.com>
wrote:

> Hi everyone,
>
>
> I wonder if there are plans to implement real time streaming in Spark. I
> see that in Spark 2.0 Trigger can have more implementations than
> ProcessingTime.
>
>
> In my opinion Real Time streaming (so reaction on every event - like
> continous queries in Apache Ignite) will be very useful and will fill gap
> that is currently in Spark. Now, if we must implement both real-time
> streaming and batch jobs, the streaming must be done in other frameworks as
> Spark allows us only to process event in Micro Batches. Matei Zaharia
> wrote in Databricks blog about  Continuous Applications [1], in my
> opinion adding EventTrigger will be next big step to Continuous
> Applications.
>
>
> What do you think about it? Are there any plans to implement such
> event-based trigger? Of course I can help with implementation, however I'm
> just starting learning Spark internals and it will take a while before I
> would be able to write something.
>
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> [1] https://databricks.com/blog/2016/07/28/continuous-
> applications-evolving-streaming-in-apache-spark-2-0.html
>
> <https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html>
> Continuous Applications: Evolving Streaming in Apache Spark 2.0
> <https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html>
> databricks.com
> Apache Spark 2.0 lays the foundation for Continuous Applications, a
> simplified and unified way to write end-to-end streaming applications that
> reacts to data in real-time.
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Luciano Resende
+1 (non-binding)

On Sat, Sep 24, 2016 at 3:08 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5c
> 2cc2730d17)
>
> This release candidate resolves 290 issues: https://s.apache.org/spark-2.
> 0.1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1201/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-05 Thread Luciano Resende
It usually don't take that long to be synced, I still don't see any 2.0.1
related artifacts on maven central

http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.spark%22%20AND%20v%3A%222.0.1%22


On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin <r...@databricks.com> wrote:

> They have been published yesterday, but can take a while to propagate.
>
>
> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar <p...@infynyxx.com>
> wrote:
>
>> Hi,
>>
>> It seems like, 2.0.1 artifact hasn't been published to Maven Central. Can
>> anyone confirm?
>>
>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> We are happy to announce the availability of Spark 2.0.1!
>>>
>>> Apache Spark 2.0.1 is a maintenance release containing 300 stability and
>>> bug fixes. This release is based on the branch-2.0 maintenance branch of
>>> Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
>>> release.
>>>
>>> To download Apache Spark 2.0.1, visit http://spark.apache.org/downlo
>>> ads.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> patches to this release.
>>>
>>>
>>>
>>
>>
>> --
>> --
>> Cheers,
>> Praj
>>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [ANNOUNCE] Announcing Spark 2.0.1

2016-10-06 Thread Luciano Resende
I have created a Infra jira to track the issue with the maven artifacts for
Spark 2.0.1

On Wed, Oct 5, 2016 at 10:18 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Yeah I see the apache maven repos have the 2.0.1 artifacts at
> https://repository.apache.org/content/repositories/releases/
> org/apache/spark/spark-core_2.11/
> -- Not sure why they haven't synced to maven central yet
>
> Shivaram
>
> On Wed, Oct 5, 2016 at 8:37 PM, Luciano Resende <luckbr1...@gmail.com>
> wrote:
> > It usually don't take that long to be synced, I still don't see any 2.0.1
> > related artifacts on maven central
> >
> > http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.
> apache.spark%22%20AND%20v%3A%222.0.1%22
> >
> >
> > On Tue, Oct 4, 2016 at 1:23 PM, Reynold Xin <r...@databricks.com> wrote:
> >>
> >> They have been published yesterday, but can take a while to propagate.
> >>
> >>
> >> On Tue, Oct 4, 2016 at 12:58 PM, Prajwal Tuladhar <p...@infynyxx.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> It seems like, 2.0.1 artifact hasn't been published to Maven Central.
> Can
> >>> anyone confirm?
> >>>
> >>> On Tue, Oct 4, 2016 at 5:39 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>>>
> >>>> We are happy to announce the availability of Spark 2.0.1!
> >>>>
> >>>> Apache Spark 2.0.1 is a maintenance release containing 300 stability
> and
> >>>> bug fixes. This release is based on the branch-2.0 maintenance branch
> of
> >>>> Spark. We strongly recommend all 2.0.0 users to upgrade to this stable
> >>>> release.
> >>>>
> >>>> To download Apache Spark 2.0.1, visit
> >>>> http://spark.apache.org/downloads.html
> >>>>
> >>>> We would like to acknowledge all community members for contributing
> >>>> patches to this release.
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --
> >>> Cheers,
> >>> Praj
> >>
> >>
> >
> >
> >
> > --
> > Luciano Resende
> > http://twitter.com/lresende1975
> > http://lresende.blogspot.com/
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Luciano Resende
On Wed, Sep 7, 2016 at 11:57 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I think you should ask legal about how to have some Maven artifacts for
> these. Both Ganglia and Kinesis are very widely used, so it's weird to ask
> users to build them from source. Maybe the Maven artifacts can be marked as
> being under a different license?
>
>
As long as they are not part of an "Apache licensed"  distribution. Note
that Ganglia seems to have changed license to BSD and we might be able to
better support that.


> In the initial discussion for LEGAL-198, we were told the following:
>
> "If the component that uses this dependency is not required for the rest
> of Spark to function then you can have a subproject to build the component.
> See http://www.apache.org/legal/resolved.html#optional. This means you
> will have to provide instructions for users to enable the optional
> component (which IMO should provide pointers to the licensing)."
>
> It's not clear whether "enable the optional component" means "every user
> must build it from source", or whether we could tell users "here's a Maven
> coordinate you can add to your project if you're okay with the licensing".
>

I think the key here is "optional", while the Kinesis is optional for Spark
(which makes it ok to have it in Spark) it is not optional for Kinesis
extension, which thenm IMHO, does not allow us to publish the Kinesis
artifact either.

But let's wait on the response from Legal before we actually implement a
solution.


>
> Matei
>
> > On Sep 7, 2016, at 11:35 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > (Credit to Luciano for pointing it out)
> >
> > Yes it's clear why the assembly can't be published but I had the same
> > question about the non-assembly Kinesis (and ganglia) artifact,
> > because the published artifact has no code from Kinesis.
> >
> > See the related discussion at
> > https://issues.apache.org/jira/browse/LEGAL-198 ; the point I took
> > from there is that the Spark Kinesis artifact is optional with respect
> > to Spark, but still something published by Spark, and it requires the
> > Amazon-licensed code non-optionally.
> >
> > I'll just ask that question to confirm or deny.
> >
> > (It also has some background on why the Amazon License is considered
> > "Category X" in ASF policy due to field of use restrictions. I myself
> > take that as read rather than know the details of that decision.)
> >
> > On Wed, Sep 7, 2016 at 6:44 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
> >> I don't see a reason to remove the non-assembly artifact, why would
> >> you?  You're not distributing copies of Amazon licensed code, and the
> >> Amazon license goes out of its way not to over-reach regarding
> >> derivative works.
> >>
> >> This seems pretty clearly to fall in the spirit of
> >>
> >> http://www.apache.org/legal/resolved.html#optional
> >>
> >> I certainly think the majority of Spark users will still want to use
> >> Spark without adding Kinesis
> >>
> >> On Wed, Sep 7, 2016 at 3:29 AM, Sean Owen <so...@cloudera.com> wrote:
> >>> It's worth calling attention to:
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-17418
> >>> https://issues.apache.org/jira/browse/SPARK-17422
> >>>
> >>> It looks like we need to at least not publish the kinesis *assembly*
> >>> Maven artifact because it contains Amazon Software Licensed-code
> >>> directly.
> >>>
> >>> However there's a reasonably strong reason to believe that we'd have
> >>> to remove the non-assembly Kinesis artifact too, as well as the
> >>> Ganglia one. This doesn't mean it goes away from the project, just
> >>> means it would no longer be published as a Maven artifact. (These have
> >>> never been bundled in the main Spark artifacts.)
> >>>
> >>> I wanted to give a heads up to see if anyone a) believes this
> >>> conclusion is wrong or b) wants to take it up with legal@? I'm
> >>> inclined to believe we have to remove them given the interpretation
> >>> Luciano has put forth.
> >>>
> >>> Sean
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Removing published kinesis, ganglia artifacts due to license issues?

2016-09-07 Thread Luciano Resende
On Wed, Sep 7, 2016 at 12:20 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

>
> It is good to get clarification, but the way I read it, the issue is
> whether we publish it as official Apache artifacts (in maven, etc).
>
> Users can of course build it directly (and we can make it easy to do so) -
> as they are explicitly agreeing to additional licenses.
>
> Regards
> Mridul
>
>
+1, by providing instructions on how the user would build, and attaching
the license details on the instructions, we are then safe on the legal
aspects of it.



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Luciano Resende
+1 (non-binding)

On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin <r...@databricks.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa4
> 577ba4be38)
>
> This release candidate resolves 301 issues: https://s.apache.org/spark-2.
> 0.1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1203/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
> (i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: welcoming Xiao Li as a committer

2016-10-04 Thread Luciano Resende
Congratulations Sean !!!

On Monday, October 3, 2016, Reynold Xin  wrote:

> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
>

-- 
Sent from my Mobile device


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Luciano Resende
+1 (non-binding)

On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin <r...@databricks.com> wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Signing releases with pwendell or release manager's key?

2017-09-19 Thread Luciano Resende
Manually signing seems a good compromise for now, but note that there are
two places that this needs to happen, the artifacts that goes to dist.a.o
as well as the ones that are published to maven.

On Tue, Sep 19, 2017 at 8:53 AM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> +1. Thanks for coming up with a solution, everyone! I think the manually
> signed RC as a work around will work well, and it will be an improvement
> for the rest to be updated.
>
> On Mon, Sep 18, 2017 at 8:25 PM, Patrick Wendell <patr...@databricks.com>
> wrote:
>
>> Sounds good - thanks Holden!
>>
>> On Mon, Sep 18, 2017 at 8:21 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> That sounds like a pretty good temporary work around if folks agree I'll
>>> cancel release vote for 2.1.2 and work on getting an RC2 out later this
>>> week manually signed. I've filed JIRA SPARK-22055 & SPARK-22054 to port the
>>> release scripts and allow injecting of the RM's key.
>>>
>>> On Mon, Sep 18, 2017 at 8:11 PM, Patrick Wendell <patr...@databricks.com
>>> > wrote:
>>>
>>>> For the current release - maybe Holden could just sign the artifacts
>>>> with her own key manually, if this is a concern. I don't think that would
>>>> require modifying the release pipeline, except to just remove/ignore the
>>>> existing signatures.
>>>>
>>>> - Patrick
>>>>
>>>> On Mon, Sep 18, 2017 at 7:56 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Does anybody know whether this is a hard blocker? If it is not, we
>>>>> should probably push 2.1.2 forward quickly and do the infrastructure
>>>>> improvement in parallel.
>>>>>
>>>>> On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> I'm more than willing to help migrate the scripts as part of either
>>>>>> this release or the next.
>>>>>>
>>>>>> It sounds like there is a consensus developing around changing the
>>>>>> process -- should we hold off on the 2.1.2 release or roll this into the
>>>>>> next one?
>>>>>>
>>>>>> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin <van...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to this. There should be a script in the Spark repo that has all
>>>>>>> the logic needed for a release. That script should take the RM's key
>>>>>>> as a parameter.
>>>>>>>
>>>>>>> if there's a desire to keep the current Jenkins job to create the
>>>>>>> release, it should be based on that script. But from what I'm seeing
>>>>>>> there are currently too many unknowns in the release process.
>>>>>>>
>>>>>>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>> > I don't understand why it is necessary to share a release key. If
>>>>>>> this is
>>>>>>> > something that can be automated in a Jenkins job, then can it be a
>>>>>>> script
>>>>>>> > with a reasonable set of build requirements for Mac and Ubuntu?
>>>>>>> That's the
>>>>>>> > approach I've seen the most in other projects.
>>>>>>> >
>>>>>>> > I'm also not just concerned about release managers. Having a key
>>>>>>> stored
>>>>>>> > persistently on outside infrastructure adds the most risk, as
>>>>>>> Luciano noted
>>>>>>> > as well. We should also start publishing checksums in the Spark
>>>>>>> VOTE thread,
>>>>>>> > which are currently missing. The risk I'm concerned about is that
>>>>>>> if the key
>>>>>>> > were compromised, it would be possible to replace binaries with
>>>>>>> perfectly
>>>>>>> > valid ones, at least on some mirrors. If the Apache copy were
>>>>>>> replaced, then
>>>>>>> > we wouldn't even be able to catch that it had happened. Given the
>>>>>>> high
>>>>>>> > profile of Spark and the number of companies that run it, I think
>>>>>>> we need to
>>>>>>> > take extra care to make sure that can't happen, even if it is an
>>>>>>> annoyance
>>>>>>> > for the release managers.
>>>>>>>
>>>>>>> --
>>>>>>> Marcelo
>>>>>>>
>>>>>>> 
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Luciano Resende
t;>
>>>>>>> Yeah I had meant to ask about that in the past. While I presume
>>>>>>> Patrick consents to this and all that, it does mean that anyone with 
>>>>>>> access
>>>>>>> to said Jenkins scripts can create a signed Spark release, regardless of
>>>>>>> who they are.
>>>>>>>
>>>>>>> I haven't thought through whether that's a theoretical issue we can
>>>>>>> ignore or something we need to fix up. For example you can't get a 
>>>>>>> release
>>>>>>> on the ASF mirrors without more authentication.
>>>>>>>
>>>>>>> How hard would it be to make the script take in a key? it sort of
>>>>>>> looks like the script already takes GPG_KEY, but don't know how to 
>>>>>>> modify
>>>>>>> the jobs. I suppose it would be ideal, in any event, for the actual 
>>>>>>> release
>>>>>>> manager to sign.
>>>>>>>
>>>>>>> On Fri, Sep 15, 2017 at 8:28 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> That's a good question, I built the release candidate however the
>>>>>>>> Jenkins scripts don't take a parameter for configuring who signs them
>>>>>>>> rather it always signs them with Patrick's key. You can see this from
>>>>>>>> previous releases which were managed by other folks but still signed by
>>>>>>>> Patrick.
>>>>>>>>
>>>>>>>> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The signature is valid, but why was the release signed with
>>>>>>>>> Patrick Wendell's private key? Did Patrick build the release 
>>>>>>>>> candidate?
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Should Flume integration be behind a profile?

2017-10-02 Thread Luciano Resende
On Mon, Oct 2, 2017 at 12:34 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I'd agree with #1 or #2. Deprecation now seems fine.
>
> Perhaps this should be raised on the user list also?
>
> And perhaps it makes sense to look at moving the Flume support into Apache
> Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the
> current state of the connector could keep going for those users who may
> need it.
>
>
+1

Apache Bahir main goal is to provide extensions to multiple distributed
analytic platforms, extending their reach with a diversity of streaming
connectors and SQL data sources. Apache Bahir would welcome proposals to
move extensions from Apache Spark to itself, this would give more
flexibility to the Spark dev community as they could focus on core
functionality, without loosing the ability to enhance these extensions as
most of Spark committers have write access to Bahir repositories. Also,
users should not see much difference, as Bahir have been creating releases
for every Spark release.

If the Spark dev community decides to move to this route, please create a
jira on the Bahir project and we could use this thread or a new specific
one to discuss any details.

Thanks

-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-28 Thread Luciano Resende
+1 (non-binding)

Minor comments:
The apache infra has a staging repository to add release candidates, and it
might be better/simpler to use that instead of home.a.o. See
https://dist.apache.org/repos/dist/dev/spark/.



On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Wednesday October 4th at 23:59 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc2
> <https://github.com/apache/spark/tree/v2.1.2-rc2> (fabbb7f59e47590
> 114366d14e15fbbff8c88593c)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1251
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2>
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3 fixed into 2.1.2 as appropriate.
>
> *Why the longer voting window?*
>
> Since there is a large industry big data conference this week I figured
> I'd add a little bit of extra buffer time just to make sure everyone has a
> chance to take a look.
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/