Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Btw - the documnetation on github represents the source code of our
docs, which is versioned with each release. Unfortunately github will
always try to render ".md" files so it could look to a passerby like
this is supposed to represent published docs. This is a feature
limitation of github, AFAIK we cannot disable it.

The official published docs are associated with each release and
available on the apache.org website. I think "/latest" is a common
convention for referring to the latest *published release* docs, so
probably we can't change that (the audience for /latest is orders of
magnitude larger than for snapshot docs). However we could just add
/snapshot and publish docs there.

- Patrick

On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell  wrote:
> Hey Ryan,
>
> The existing JIRA also covers publishing nightly docs:
> https://issues.apache.org/jira/browse/SPARK-1517
>
> - Patrick
>
> On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
>  wrote:
>> Thanks Nicholas, glad to hear that some of this info will be pushed to the
>> main site soon, but this brings up yet another point of confusion that I've
>> struggled with, namely whether the documentation on github or that on
>> spark.apache.org should be considered the primary reference for people
>> seeking to learn about best practices for developing Spark.
>>
>> Trying to read docs starting from
>> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
>> that all of the links to other parts of the documentation are broken: they
>> point to relative paths that end in ".html", which will work when published
>> on the docs-site, but that would have to end in ".md" if a person was to be
>> able to navigate them on github.
>>
>> So expecting people to use the up-to-date docs on github (where all
>> internal URLs 404 and the main github README suggests that the "latest
>> Spark documentation" can be found on the actually-months-old docs-site
>> <https://github.com/apache/spark#online-documentation>) is not a good
>> solution. On the other hand, consulting months-old docs on the site is also
>> problematic, as this thread and your last email have borne out.  The result
>> is that there is no good place on the internet to learn about the most
>> up-to-date best practices for using/developing Spark.
>>
>> Why not build http://spark.apache.org/docs/latest/ nightly (or every
>> commit) off of what's in github, rather than having that URL point to the
>> last release's docs (up to ~3 months old)? This way, casual users who want
>> the docs for the released version they happen to be using (which is already
>> frequently != "/latest" today, for many Spark users) can (still) find them
>> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
>> point people to a site (/latest) that actually has up-to-date docs that
>> reflect ToT and whose links work.
>>
>> If there are concerns about existing semantics around "/latest" URLs being
>> broken, some new URL could be used, like
>> http://spark.apache.org/docs/snapshot/, but given that everything under
>> http://spark.apache.org/docs/latest/ is in a state of
>> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
>> that serious an issue to me; anyone sending around permanent links to
>> things under /latest is already going to have those links break / not make
>> sense in the near future.
>>
>>
>> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>>
>>>- currently the docs only contain information about building with
>>>maven,
>>>and even then don't cover many important cases
>>>
>>>  All other points aside, I just want to point out that the docs document
>>> both how to use Maven and SBT and clearly state
>>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>>> that Maven is the "build of reference" while SBT may be preferable for
>>> day-to-day development.
>>>
>>> I believe the main reason most people miss this documentation is that,
>>> though it's up-to-date on GitHub, it has't been published yet to the docs
>>> site. It should go out with the 1.2 release.
>>>
>>> Improvements to the documentation on building Spark belong here:
>>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>>
>>> If there are clear recommendations that come out of this thread but are

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hey Ryan,

The existing JIRA also covers publishing nightly docs:
https://issues.apache.org/jira/browse/SPARK-1517

- Patrick

On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
 wrote:
> Thanks Nicholas, glad to hear that some of this info will be pushed to the
> main site soon, but this brings up yet another point of confusion that I've
> struggled with, namely whether the documentation on github or that on
> spark.apache.org should be considered the primary reference for people
> seeking to learn about best practices for developing Spark.
>
> Trying to read docs starting from
> https://github.com/apache/spark/blob/master/docs/index.md right now, I find
> that all of the links to other parts of the documentation are broken: they
> point to relative paths that end in ".html", which will work when published
> on the docs-site, but that would have to end in ".md" if a person was to be
> able to navigate them on github.
>
> So expecting people to use the up-to-date docs on github (where all
> internal URLs 404 and the main github README suggests that the "latest
> Spark documentation" can be found on the actually-months-old docs-site
> <https://github.com/apache/spark#online-documentation>) is not a good
> solution. On the other hand, consulting months-old docs on the site is also
> problematic, as this thread and your last email have borne out.  The result
> is that there is no good place on the internet to learn about the most
> up-to-date best practices for using/developing Spark.
>
> Why not build http://spark.apache.org/docs/latest/ nightly (or every
> commit) off of what's in github, rather than having that URL point to the
> last release's docs (up to ~3 months old)? This way, casual users who want
> the docs for the released version they happen to be using (which is already
> frequently != "/latest" today, for many Spark users) can (still) find them
> at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
> point people to a site (/latest) that actually has up-to-date docs that
> reflect ToT and whose links work.
>
> If there are concerns about existing semantics around "/latest" URLs being
> broken, some new URL could be used, like
> http://spark.apache.org/docs/snapshot/, but given that everything under
> http://spark.apache.org/docs/latest/ is in a state of
> planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
> that serious an issue to me; anyone sending around permanent links to
> things under /latest is already going to have those links break / not make
> sense in the near future.
>
>
> On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>>- currently the docs only contain information about building with
>>maven,
>>and even then don't cover many important cases
>>
>>  All other points aside, I just want to point out that the docs document
>> both how to use Maven and SBT and clearly state
>> <https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt>
>> that Maven is the "build of reference" while SBT may be preferable for
>> day-to-day development.
>>
>> I believe the main reason most people miss this documentation is that,
>> though it's up-to-date on GitHub, it has't been published yet to the docs
>> site. It should go out with the 1.2 release.
>>
>> Improvements to the documentation on building Spark belong here:
>> https://github.com/apache/spark/blob/master/docs/building-spark.md
>>
>> If there are clear recommendations that come out of this thread but are
>> not in that doc, they should be added in there. Other, less important
>> details may possibly be better suited for the Contributing to Spark
>> <https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark>
>> guide.
>>
>> Nick
>>
>>
>> On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell 
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> A few more things here. You should feel free to send patches to
>>> Jenkins to test them, since this is the reference environment in which
>>> we regularly run tests. This is the normal workflow for most
>>> developers and we spend a lot of effort provisioning/maintaining a
>>> very large jenkins cluster to allow developers access this resource. A
>>> common development approach is to locally run tests that you've added
>>> in a patch, then send it to jenkins for the full run, and then try to
>>> debug locally if you see specific unanticipated test failures.
>>>

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hey Ryan,

A few more things here. You should feel free to send patches to
Jenkins to test them, since this is the reference environment in which
we regularly run tests. This is the normal workflow for most
developers and we spend a lot of effort provisioning/maintaining a
very large jenkins cluster to allow developers access this resource. A
common development approach is to locally run tests that you've added
in a patch, then send it to jenkins for the full run, and then try to
debug locally if you see specific unanticipated test failures.

One challenge we have is that given the proliferation of OS versions,
Java versions, Python versions, ulimits, etc. there is a combinatorial
number of environments in which tests could be run. It is very hard in
some cases to figure out post-hoc why a given test is not working in a
specific environment. I think a good solution here would be to use a
standardized docker container for running Spark tests and asking folks
to use that locally if they are trying to run all of the hundreds of
Spark tests.

Another solution would be to mock out every system interaction in
Spark's tests including e.g. filesystem interactions to try and reduce
variance across environments. However, that seems difficult.

As the number of developers of Spark increases, it's definitely a good
idea for us to invest in developer infrastructure including things
like snapshot releases, better documentation, etc. Thanks for bringing
this up as a pain point.

- Patrick


On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
 wrote:
> thanks for the info, Matei and Brennon. I will try to switch my workflow to
> using sbt. Other potential action items:
>
> - currently the docs only contain information about building with maven,
> and even then don't cover many important cases, as I described in my
> previous email. If SBT is as much better as you've described then that
> should be made much more obvious. Wasn't it the case recently that there
> was only a page about building with SBT, and not one about building with
> maven? Clearer messaging around this needs to exist in the documentation,
> not just on the mailing list, imho.
>
> - +1 to better distinguishing between unit and integration tests, having
> separate scripts for each, improving documentation around common workflows,
> expectations of brittleness with each kind of test, advisability of just
> relying on Jenkins for certain kinds of tests to not waste too much time,
> etc. Things like the compiler crash should be discussed in the
> documentation, not just in the mailing list archives, if new contributors
> are likely to run into them through no fault of their own.
>
> - What is the algorithm you use to decide what tests you might have broken?
> Can we codify it in some scripts that other people can use?
>
>
>
> On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia 
> wrote:
>
>> Hi Ryan,
>>
>> As a tip (and maybe this isn't documented well), I normally use SBT for
>> development to avoid the slow build process, and use its interactive
>> console to run only specific tests. The nice advantage is that SBT can keep
>> the Scala compiler loaded and JITed across builds, making it faster to
>> iterate. To use it, you can do the following:
>>
>> - Start the SBT interactive console with sbt/sbt
>> - Build your assembly by running the "assembly" target in the assembly
>> project: assembly/assembly
>> - Run all the tests in one module: core/test
>> - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
>> also supports tab completion)
>>
>> Running all the tests does take a while, and I usually just rely on
>> Jenkins for that once I've run the tests for the things I believed my patch
>> could break. But this is because some of them are integration tests (e.g.
>> DistributedSuite, which creates multi-process mini-clusters). Many of the
>> individual suites run fast without requiring this, however, so you can pick
>> the ones you want. Perhaps we should find a way to tag them so people  can
>> do a "quick-test" that skips the integration ones.
>>
>> The assembly builds are annoying but they only take about a minute for me
>> on a MacBook Pro with SBT warmed up. The assembly is actually only required
>> for some of the "integration" tests (which launch new processes), but I'd
>> recommend doing it all the time anyway since it would be very confusing to
>> run those with an old assembly. The Scala compiler crash issue can also be
>> a problem, but I don't see it very often with SBT. If it happens, I exit
>> SBT and do sbt clean.
>>
>> Anyway, this is useful feedback and I think we should try to improve some
>> of these suites, but hopefully you can also try the faster SBT process. At
>> the end of the day, if we want integration tests, the whole test process
>> will take an hour, but most of the developers I know leave that to Jenkins
>> and only run individual tests locally before submitting a patch.
>>
>> Matei
>>
>>
>> > On Nov 30, 2014, at 2:39

Re: Trouble testing after updating to latest master

2014-11-29 Thread Patrick Wendell
Sounds good. Glad you got it working.

On Sat, Nov 29, 2014 at 11:16 PM, Ganelin, Ilya
 wrote:
> I am able to successfully run sbt/sbt-compile and run the tests after
> running git clean -fdx. I¹m guessing network issues wound up corrupting
> some of the files that had been downloaded. Thanks, Patrick!
>
>
> On 11/29/14, 10:52 PM, "Patrick Wendell"  wrote:
>
>>Thanks for reporting this. One thing to try is to just do a git clean
>>to make sure you have a totally clean working space ("git clean -fdx"
>>will blow away any differences you have from the repo, of course only
>>do that if you don't have other files around). Can you reproduce this
>>if you just run "sbt/sbt compile"? Also, if you can, can you reproduce
>>it if you checkout only the spark master branch and not merged with
>>your own code? Finally, if you can reproduce it on master, can you
>>perform a bisection to find out which commit caused it?
>>
>>- Patrick
>>
>>On Sat, Nov 29, 2014 at 10:29 PM, Ganelin, Ilya
>> wrote:
>>> Hi all - I've just merged in the latest changes from the Spark master
>>>branch to my local branch. I am able to build just fine with
>>> mvm clean package
>>> However, when I attempt to run dev/run-tests, I get the following error:
>>>
>>> Using /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home
>>>as default JAVA_HOME.
>>> Note, this will be overridden by -java-home if it is set.
>>> Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
>>> [error] Got a return code of 1 on line 163 of the run-tests script.
>>>
>>> With an individual test I get the same error. I have tried downloading
>>>a new copy of SBT 0.13.6 but it has not helped. Does anyone have any
>>>suggestions for getting this running? Things worked fine before updating
>>>Spark.
>>> 
>>>
>>> The information contained in this e-mail is confidential and/or
>>>proprietary to Capital One and/or its affiliates. The information
>>>transmitted herewith is intended only for use by the individual or
>>>entity to which it is addressed.  If the reader of this message is not
>>>the intended recipient, you are hereby notified that any review,
>>>retransmission, dissemination, distribution, copying or other use of, or
>>>taking of any action in reliance upon this information is strictly
>>>prohibited. If you have received this communication in error, please
>>>contact the sender and delete the material from your computer.
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Trouble testing after updating to latest master

2014-11-29 Thread Patrick Wendell
Thanks for reporting this. One thing to try is to just do a git clean
to make sure you have a totally clean working space ("git clean -fdx"
will blow away any differences you have from the repo, of course only
do that if you don't have other files around). Can you reproduce this
if you just run "sbt/sbt compile"? Also, if you can, can you reproduce
it if you checkout only the spark master branch and not merged with
your own code? Finally, if you can reproduce it on master, can you
perform a bisection to find out which commit caused it?

- Patrick

On Sat, Nov 29, 2014 at 10:29 PM, Ganelin, Ilya
 wrote:
> Hi all - I've just merged in the latest changes from the Spark master branch 
> to my local branch. I am able to build just fine with
> mvm clean package
> However, when I attempt to run dev/run-tests, I get the following error:
>
> Using /Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home as 
> default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> Error: Invalid or corrupt jarfile sbt/sbt-launch-0.13.6.jar
> [error] Got a return code of 1 on line 163 of the run-tests script.
>
> With an individual test I get the same error. I have tried downloading a new 
> copy of SBT 0.13.6 but it has not helped. Does anyone have any suggestions 
> for getting this running? Things worked fine before updating Spark.
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-29 Thread Patrick Wendell
Thanks for pointing this out, Matei. I don't think a minor typo like
this is a big deal. Hopefully it's clear to everyone this is the 1.2.0
release vote, as indicated by the subject and all of the artifacts.

On Sat, Nov 29, 2014 at 1:26 AM, Matei Zaharia  wrote:
> Hey Patrick, unfortunately you got some of the text here wrong, saying 1.1.0 
> instead of 1.2.0. Not sure it will matter since there can well be another RC 
> after testing, but we should be careful.
>
> Matei
>
>> On Nov 28, 2014, at 9:16 PM, Patrick Wendell  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 1.2.0!
>>
>> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1048/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.2.0!
>>
>> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.1.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> == What justifies a -1 vote for this release? ==
>> This vote is happening very late into the QA period compared with
>> previous votes, so -1 votes should only occur for significant
>> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> regressions, or bugs related to new features will not block this
>> release.
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.shuffle.blockTransferService" has been
>> changed to "netty"
>> --> Old behavior can be restored by switching to "nio"
>>
>> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
>> --> Old behavior can be restored by setting "spark.shuffle.manager" to 
>> "hash".
>>
>> == Other notes ==
>> Because this vote is occurring over a weekend, I will likely extend
>> the vote if this RC survives until the end of the vote period.
>>
>> - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1048/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Tuesday, December 02, at 05:15 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of "spark.shuffle.blockTransferService" has been
changed to "netty"
--> Old behavior can be restored by switching to "nio"

2. The default value of "spark.shuffle.manager" has been changed to "sort".
--> Old behavior can be restored by setting "spark.shuffle.manager" to "hash".

== Other notes ==
Because this vote is occurring over a weekend, I will likely extend
the vote if this RC survives until the end of the vote period.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Notes on writing complex spark applications

2014-11-23 Thread Patrick Wendell
Hey Evan,

It might be nice to merge this into existing documentation. In
particular, a lot of this could serve to update the current tuning
section and programming guides.

It could also work to paste this wholesale as a reference for Spark
users, but in that case it's less likely to get updated when other
things change, or be found by users reading through the spark docs.

- Patrick

On Sun, Nov 23, 2014 at 8:27 PM, Inkyu Lee  wrote:
> Very helpful!!
>
> thank you very much!
>
> 2014-11-24 2:17 GMT+09:00 Sam Bessalah :
>
>> Thanks Evan, this is great.
>> On Nov 23, 2014 5:58 PM, "Evan R. Sparks"  wrote:
>>
>> > Hi all,
>> >
>> > Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
>> > working on a short document about writing high performance Spark
>> > applications based on our experience developing MLlib, GraphX, ml-matrix,
>> > pipelines, etc. It may be a useful document both for users and new Spark
>> > developers - perhaps it should go on the wiki?
>> >
>> > The document itself is here:
>> >
>> >
>> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
>> > and I've created SPARK-4565
>> >  to track this.
>> >
>> > - Evan
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Patrick Wendell
Hey Stephen,

Thanks for bringing this up. Technically when we call a release vote
it needs to be on the exact commit that will be the final release.
However, one thing I've thought of doing for a while would be to
publish the maven artifacts using a version tag with $VERSION-rcX even
if the underlying commit has $VERSION in the pom files. Some recent
changes I've made to the way we do publishing in branch 1.2 should
make this pretty easy - it wasn't very easy before because we used
maven's publishing plugin which makes modifying the published version
tricky. Our current approach is, indeed, problematic because maven
artifacts are supposed to be immutable once they have a specific
version identifier.

I created SPARK-4568 to track this:
https://issues.apache.org/jira/browse/SPARK-4568

- Patrick

On Sun, Nov 23, 2014 at 8:11 PM, Matei Zaharia  wrote:
> Interesting, perhaps we could publish each one with two IDs, of which the rc 
> one is unofficial. The problem is indeed that you have to vote on a hash for 
> a potentially final artifact.
>
> Matei
>
>> On Nov 23, 2014, at 7:54 PM, Stephen Haberman  
>> wrote:
>>
>> Hi,
>>
>> I wanted to try 1.1.1-rc2 because we're running into SPARK-3633, but
>> the"rc" releases not being tagged with "-rcX" means the pre-built artifacts
>> are basically useless to me.
>>
>> (Pedantically, to test a release, I have to upload it into our internal
>> repo, to compile jobs, start clusters, etc. Invariably when an rcX artifact
>> ends up not being final, then I'm screwed, because I would have to clear
>> the local cache of any of our machines, dev/Jenkins/etc., that ever
>> downloaded the "formerly known as 1.1.1 but not really" rc artifacts.)
>>
>> What's frustrating is that I know other Apache projects do rc releases, and
>> even get them into Maven central, e.g.:
>>
>> http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.tapestry%22%20AND%20a%3A%22tapestry-ioc%22
>>
>> So, I apologize for the distraction from getting real work done, but
>> perhaps you guys could find a creative way to work around the
>> well-intentioned mandate on artifact voting?
>>
>> (E.g. perhaps have multiple votes, one for each successive rc (with -rcX
>> suffix), then, once blessed, another one on the actually-final/no-rcX
>> artifact (built from the last rc's tag); or publish no-rcX artifacts for
>> official voting, as today, but then, at the same time, add -rcX artifacts
>> to Maven central for non-binding/3rd party testing, etc.)
>>
>> Thanks,
>> Stephen
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-23 Thread Patrick Wendell
+1 (binding).

Don't see any evidence of regressions at this point. The issue
reported by Hector was not related to this rlease.

On Sun, Nov 23, 2014 at 9:50 AM, Debasish Das  wrote:
> -1 from me...same FetchFailed issue as what Hector saw...
>
> I am running Netflix dataset and dumping out recommendation for all users.
> It shuffles around 100 GB data on disk to run a reduceByKey per user on
> utils.BoundedPriorityQueue...The code runs fine with MovieLens1m dataset...
>
> I gave Spark 10 nodes, 8 cores, 160 GB of memory.
>
> Fails with the following FetchFailed errors.
>
> 14/11/23 11:51:22 WARN TaskSetManager: Lost task 28.0 in stage 188.0 (TID
> 2818, tblpmidn08adv-hdp.tdc.vzwcorp.com): FetchFailed(BlockManagerId(1,
> tblpmidn03adv-hdp.tdc.vzwcorp.com, 52528, 0), shuffleId=35, mapId=28,
> reduceId=28)
>
> It's a consistent behavior on master as well.
>
> I tested it both on YARN and Standalone. I compiled spark-1.1 branch
> (assuming it has all the fixes from RC2 tag.
>
> I am now compiling spark-1.0 branch and see if this issue shows up there as
> well. If it is related to hash/sort based shuffle most likely it won't show
> up on 1.0.
>
> Thanks.
>
> Deb
>
> On Thu, Nov 20, 2014 at 12:16 PM, Hector Yee  wrote:
>
>> Whoops I must have used the 1.2 preview and mixed them up.
>>
>> spark-shell -version shows  version 1.2.0
>>
>> Will update the bug https://issues.apache.org/jira/browse/SPARK-4516 to
>> 1.2
>>
>> On Thu, Nov 20, 2014 at 11:59 AM, Matei Zaharia 
>> wrote:
>>
>> > Ah, I see. But the spark.shuffle.blockTransferService property doesn't
>> > exist in 1.1 (AFAIK) -- what exactly are you doing to get this problem?
>> >
>> > Matei
>> >
>> > On Nov 20, 2014, at 11:50 AM, Hector Yee  wrote:
>> >
>> > This is whatever was in http://people.apache.org/~andrewor14/spark-1
>> > .1.1-rc2/
>> >
>> > On Thu, Nov 20, 2014 at 11:48 AM, Matei Zaharia > >
>> > wrote:
>> >
>> >> Hector, is this a comment on 1.1.1 or on the 1.2 preview?
>> >>
>> >> Matei
>> >>
>> >> > On Nov 20, 2014, at 11:39 AM, Hector Yee 
>> wrote:
>> >> >
>> >> > I think it is a race condition caused by netty deactivating a channel
>> >> while
>> >> > it is active.
>> >> > Switched to nio and it works fine
>> >> > --conf spark.shuffle.blockTransferService=nio
>> >> >
>> >> > On Thu, Nov 20, 2014 at 10:44 AM, Hector Yee 
>> >> wrote:
>> >> >
>> >> >> I'm still seeing the fetch failed error and updated
>> >> >> https://issues.apache.org/jira/browse/SPARK-3633
>> >> >>
>> >> >> On Thu, Nov 20, 2014 at 10:21 AM, Marcelo Vanzin <
>> van...@cloudera.com>
>> >> >> wrote:
>> >> >>
>> >> >>> +1 (non-binding)
>> >> >>>
>> >> >>> . ran simple things on spark-shell
>> >> >>> . ran jobs in yarn client & cluster modes, and standalone cluster
>> mode
>> >> >>>
>> >> >>> On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or 
>> >> wrote:
>> >>  Please vote on releasing the following candidate as Apache Spark
>> >> version
>> >>  1.1.1.
>> >> 
>> >>  This release fixes a number of bugs in Spark 1.1.0. Some of the
>> >> notable
>> >> >>> ones
>> >>  are
>> >>  - [SPARK-3426] Sort-based shuffle compression settings are
>> >> incompatible
>> >>  - [SPARK-3948] Stream corruption issues in sort-based shuffle
>> >>  - [SPARK-4107] Incorrect handling of Channel.read() led to data
>> >> >>> truncation
>> >>  The full list is at http://s.apache.org/z9h and in the CHANGES.txt
>> >> >>> attached.
>> >> 
>> >>  Additionally, this candidate fixes two blockers from the previous
>> RC:
>> >>  - [SPARK-4434] Cluster mode jar URLs are broken
>> >>  - [SPARK-4480][SPARK-4467] Too many open files exception from
>> shuffle
>> >> >>> spills
>> >> 
>> >>  The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
>> >>  http://s.apache.org/p8
>> >> 
>> >>  The release files, including signatures, digests, etc can be found
>> >> at:
>> >>  http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
>> >> 
>> >>  Release artifacts are signed with the following key:
>> >>  https://people.apache.org/keys/committer/andrewor14.asc
>> >> 
>> >>  The staging repository for this release can be found at:
>> >> 
>> >> https://repository.apache.org/content/repositories/orgapachespark-1043/
>> >> 
>> >>  The documentation corresponding to this release can be found at:
>> >>  http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
>> >> 
>> >>  Please vote on releasing this package as Apache Spark 1.1.1!
>> >> 
>> >>  The vote is open until Saturday, November 22, at 23:00 UTC and
>> >> passes if
>> >>  a majority of at least 3 +1 PMC votes are cast.
>> >>  [ ] +1 Release this package as Apache Spark 1.1.1
>> >>  [ ] -1 Do not release this package because ...
>> >> 
>> >>  To learn more about Apache Spark, please see
>> >>  http://spark.apache.org/
>> >> 
>> >>  Cheers,
>> >>  Andrew
>> >> 
>> >> 
>> >> 
>> --

Re: Apache infra github sync down

2014-11-22 Thread Patrick Wendell
Hi All,

Unfortunately this went back down again. I've opened a new JIRA to track it:

https://issues.apache.org/jira/browse/INFRA-8688

- Patrick

On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell  wrote:
> Hey All,
>
> The Apache-->github mirroring is not working right now and hasn't been
> working fo more than 24 hours. This means that pull requests will not
> appear as closed even though they have been merged. It also causes
> diffs to display incorrectly in some cases. If you'd like to follow
> progress by Apache infra on this issue you can watch this JIRA:
>
> https://issues.apache.org/jira/browse/INFRA-8654
>
> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How spark and hive integrate in long term?

2014-11-22 Thread Patrick Wendell
There are two distinct topics when it comes to hive integration. Part
of the 1.3 roadmap will likely be better defining the plan for Hive
integration as Hive adds future versions.

1. Ability to interact with Hive metastore's from different versions
==> I.e. if a user has a metastore, can Spark SQL read the data? This
one we want need to solve by asking Hive for a stable metastore thrift
API, or adding sufficient features to the HCatalog API so we can use
that.

2. Compatibility with HQL over time as Hive adds new features.
==> This relates to how often we update our internal library
dependency on Hive and/or build support for new Hive features
internally.

On Sat, Nov 22, 2014 at 10:01 AM, Zhan Zhang  wrote:
> Thanks Cheng for the insights.
>
> Regarding the HCatalog, I did some initial investigation too and agree with 
> you. As of now, it seems not a good solution. I will try to talk to Hive 
> people to see whether there is such guarantee for downward compatibility for 
> thrift protocol. By the way, I tried some basic functions using hive-0.13 
> connect to hive-0.14 metastore, and it looks like they are compatible.
>
> Thanks.
>
> Zhan Zhang
>
>
> On Nov 22, 2014, at 7:14 AM, Cheng Lian  wrote:
>
>> Should emphasize that this is still a quick and rough conclusion, will 
>> investigate this in more detail after 1.2.0 release. Anyway we really like 
>> to provide Hive support in Spark SQL as smooth and clean as possible for 
>> both developers and end users.
>>
>> On 11/22/14 11:05 PM, Cheng Lian wrote:
>>>
>>> Hey Zhan,
>>>
>>> This is a great question. We are also seeking for a stable API/protocol 
>>> that works with multiple Hive versions (esp. 0.12+). SPARK-4114 
>>>  was opened for this. Did 
>>> some research into HCatalog recently, but I must confess that I'm not an 
>>> expert on HCatalog, actually spent only 1 day on exploring it. So please 
>>> don't hesitate to correct me if I was wrong about the conclusions I made 
>>> below.
>>>
>>> First, although HCatalog API is more pleasant to work with, it's 
>>> unfortunately feature incomplete. It only provides a subset of most 
>>> commonly used operations. For example, |HCatCreateTableDesc| maps only a 
>>> subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, 
>>> |skewedColNames| and |skewedColValues| are missing. It's also impossible to 
>>> alter table properties via HCatalog API (Spark SQL uses this to implement 
>>> the |ANALYZE| command). The |hcat| CLI tool provides all those features 
>>> missing in HCatalog API via raw Metastore API, and is structurally similar 
>>> to the old Hive CLI.
>>>
>>> Second, HCatalog API itself doesn't ensure compatibility, it's the Thrift 
>>> protocol that matters. HCatalog is directly built upon raw Metastore API, 
>>> and talks the same Metastore Thrift protocol. The problem we encountered in 
>>> Spark SQL is that, usually we deploy Spark SQL Hive support with embedded 
>>> mode (for testing) or local mode Metastore, and this makes us suffer from 
>>> things like Metastore database schema changes. If Hive Metastore Thrift 
>>> protocol is guaranteed to be downward compatible, then hopefully we can 
>>> resort to remote mode Metastore and always depend on most recent Hive APIs. 
>>> I had a glance of Thrift protocol version handling code in Hive, it seems 
>>> that downward compatibility is not an issue. However I didn't find any 
>>> official documents about Thrift protocol compatibility.
>>>
>>> That said, in the future, hopefully we can only depend on most recent Hive 
>>> dependencies and remove the Hive shim layer introduced in branch 1.2. For 
>>> users who use exactly the same version of Hive as Spark SQL, they can use 
>>> either remote or local/embedded Metastore; while for users who want to 
>>> interact with existing legacy Hive clusters, they have to setup a remote 
>>> Metastore and let the Thrift protocol to handle compatibility.
>>>
>>> -- Cheng
>>>
>>> On 11/22/14 6:51 AM, Zhan Zhang wrote:
>>>
 Now Spark and hive integration is a very nice feature. But I am wondering
 what the long term roadmap is for spark integration with hive. Both of 
 these
 two projects are undergoing fast improvement and changes. Currently, my
 understanding is that spark hive sql part relies on hive meta store and
 basic parser to operate, and the thrift-server intercept hive query and
 replace it with its own engine.

 With every release of hive, there need a significant effort on spark part 
 to
 support it.

 For the metastore part, we may possibly replace it with hcatalog. But given
 the dependency of other parts on hive, e.g., metastore, thriftserver,
 hcatlog may not be able to help much.

 Does anyone have any insight or idea in mind?

 Thanks.

 Zhan Zhang



 --
 View this message in 
 context:http://apache-spark-developers-list.10015

Automated github closing of issues is not working

2014-11-21 Thread Patrick Wendell
After we merge pull requests in Spark they are closed via a special
message we put in each commit description ("Closes #XXX"). This
feature stopped working around 21 hours ago causing already-merged
pull requests to display as open.

I've contacted Github support with the issue. No word from them yet.

It is not clear whether this relates to recently delays syncing with Github.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark development with IntelliJ

2014-11-20 Thread Patrick Wendell
Hi All,

I noticed people sometimes struggle to get Spark set up in IntelliJ.
I'd like to maintain comprehensive instructions on our Wiki to make
this seamless for future developers. Due to some nuances of our build,
getting to the point where you can build + test every module from
within the IDE is not trivial. I created a reference here:

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA

I'd love people to independently test this and/or share potential improvements.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Build break

2014-11-19 Thread Patrick Wendell
Hey All,

Just a heads up. I merged this patch last night which caused the Spark
build to break:

https://github.com/apache/spark/commit/397d3aae5bde96b01b4968dde048b6898bb6c914

The patch itself was fine and previously had passed on Jenkins. The
issue was that other intermediate changes merged since it last passed,
and the combination of those changes with the patch caused an issue
with our binary compatibility tests. This kind of race condition can
happen from time to time.

I've merged in a hot fix that should resolve this:
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0df02ca463a4126e5437b37114c6759a57ab71ee

We'll keep an eye this and make sure future builds are passing.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Apache infra github sync down

2014-11-18 Thread Patrick Wendell
Hey All,

The Apache-->github mirroring is not working right now and hasn't been
working fo more than 24 hours. This means that pull requests will not
appear as closed even though they have been merged. It also causes
diffs to display incorrectly in some cases. If you'd like to follow
progress by Apache infra on this issue you can watch this JIRA:

https://issues.apache.org/jira/browse/INFRA-8654

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Patrick Wendell
Hey Kevin,

If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes
here [1] - it could be that default changes caused a regression for
your workload. Do you still see a regression if you restore the
configuration changes?

It's great to hear specifically about issues like this, so please fork
a new thread and describe your workload if you see a regression. The
main focus of a patch release vote like this is to test regressions
against the previous release on the same line (e.g. 1.1.1 vs 1.1.0)
though of course we still want to be cognizant of 1.0-to-1.1
regressions and make sure we can address them down the road.

[1] https://spark.apache.org/releases/spark-release-1-1-0.html

On Mon, Nov 17, 2014 at 2:04 PM, Kevin Markey  wrote:
> +0 (non-binding)
>
> Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn,
> plain-vanilla Hadoop 2.3.0. No regressions.
>
> However, 12% to 22% increase in run time relative to 1.0.0 release.  (No
> other environment or configuration changes.)  Would have recommended +1 were
> it not for added latency.
>
> Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes,
> as we've never tested with 1.1.0. But thought I'd share the results.  (This
> is somewhat disappointing.)
>
> Kevin Markey
>
>
> On 11/17/2014 11:42 AM, Debasish Das wrote:
>>
>> Andrew,
>>
>> I put up 1.1.1 branch and I am getting shuffle failures while doing
>> flatMap
>> followed by groupBy...My cluster memory is less than the memory I need and
>> therefore flatMap does around 400 GB of shuffle...memory is around 120
>> GB...
>>
>> 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
>> 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
>> mapId=-1, reduceId=22)
>>
>> I searched on user-list and this issue has been found over there:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html
>>
>> I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
>> me
>> till we figure out the root cause...
>>
>> Thanks.
>>
>> Deb
>>
>> On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:
>>
>>> This seems like a legitimate blocker. We will cut another RC to include
>>> the
>>> revert.
>>>
>>> 2014-11-16 17:29 GMT-08:00 Kousuke Saruta :
>>>
 Now I've finished to revert for SPARK-4434 and opened PR.


 (2014/11/16 17:08), Josh Rosen wrote:

> -1
>
> I found a potential regression in 1.1.1 related to spark-submit and
> cluster
> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
>
> I think that this is worth fixing.
>
> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
> wrote:
>
>   +1
>>
>>
>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
>> are
>> fixed. Hive version inspection works as expected.
>>
>>
>> On 11/15/14 8:25 AM, Zach Fry wrote:
>>
>>   +0
>>>
>>>
>>> I expect to start testing on Monday but won't have enough results to
>>> change
>>> my vote from +0
>>> until Monday night or Tuesday morning.
>>>
>>> Thanks,
>>> Zach
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-
>>> developers-list.1001551.n3.nabble.com/VOTE-Release-
>>> Apache-Spark-1-1-1-RC1-tp9311p9370.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>> -
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-17 Thread Patrick Wendell
Hi All,

I've just posted a preview of the Spark 1.2.0. release for community
regression testing.

Issues reported now will get close attention, so please help us test!
You can help by running an existing Spark 1.X workload on this and
reporting any regressions. As we start voting, etc, the bar for
reported issues to hold the release will get higher and higher, so
test early!

The tag to be is v1.2.0-snapshot1 (commit 38c1fbd96)

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1038/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1-docs/

== Notes ==
- Maven artifacts are published for both Scala 2.10 and 2.11. Binary
distributions are not posted for Scala 2.11 yet, but will be posted
soon.

- There are two significant config default changes that users may want
to revert if doing A:B testing against older versions.

"spark.shuffle.manager" default has changed to "sort" (was "hash")
"spark.shuffle.blockTransferService" default has changed to "netty" (was "nio")

- This release contains a shuffle service for YARN. This jar is
present in all Hadoop 2.X binary packages in
"lib/spark-1.2.0-yarn-shuffle.jar"

Cheers,
Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Patrick Wendell
Neither is strictly optimal which is why we ended up supporting both.
Our reference build for packaging is Maven so you are less likely to
run into unexpected dependency issues, etc. Many developers use sbt as
well. It's somewhat religion and the best thing might be to try both
and see which you prefer.

- Patrick

On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra  wrote:
>>
>> The console mode of sbt (just run
>> sbt/sbt and then a long running console session is started that will accept
>> further commands) is great for building individual subprojects or running
>> single test suites.  In addition to being faster since its a long running
>> JVM, its got a lot of nice features like tab-completion for test case
>> names.
>
>
> We include the scala-maven-plugin in spark/pom.xml, so equivalent
> functionality is available using Maven.  You can start a console session
> with `mvn scala:console`.
>
>
> On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust 
> wrote:
>
>> I'm going to have to disagree here.  If you are building a release
>> distribution or integrating with legacy systems then maven is probably the
>> correct choice.  However most of the core developers that I know use sbt,
>> and I think its a better choice for exploration and development overall.
>> That said, this probably falls into the category of a religious argument so
>> you might want to look at both options and decide for yourself.
>>
>> In my experience the SBT build is significantly faster with less effort
>> (and I think sbt is still faster even if you go through the extra effort of
>> installing zinc) and easier to read.  The console mode of sbt (just run
>> sbt/sbt and then a long running console session is started that will accept
>> further commands) is great for building individual subprojects or running
>> single test suites.  In addition to being faster since its a long running
>> JVM, its got a lot of nice features like tab-completion for test case
>> names.
>>
>> For example, if I wanted to see what test cases are available in the SQL
>> subproject you can do the following:
>>
>> [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
>> [info] Loading project definition from
>> /Users/marmbrus/workspace/spark/project/project
>> [info] Loading project definition from
>>
>> /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
>> [info] Set current project to spark-parent (in build
>> file:/Users/marmbrus/workspace/spark/)
>> > sql/test-only **
>> --
>>  org.apache.spark.sql.CachedTableSuite
>> org.apache.spark.sql.DataTypeSuite
>>  org.apache.spark.sql.DslQuerySuite
>> org.apache.spark.sql.InsertIntoSuite
>> ...
>>
>> Another very useful feature is the development console, which starts an
>> interactive REPL including the most recent version of the code and a lot of
>> useful imports for some subprojects.  For example in the hive subproject it
>> automatically sets up a temporary database with a bunch of test data
>> pre-loaded:
>>
>> $ sbt/sbt hive/console
>> > hive/console
>> ...
>> import org.apache.spark.sql.hive._
>> import org.apache.spark.sql.hive.test.TestHive._
>> import org.apache.spark.sql.parquet.ParquetTestData
>> Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_45).
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> sql("SELECT * FROM src").take(2)
>> res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])
>>
>> Michael
>>
>> On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody <
>> dineshjweerakk...@gmail.com> wrote:
>>
>> > Hi Stephen and Sean,
>> >
>> > Thanks for correction.
>> >
>> > On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen  wrote:
>> >
>> > > No, the Maven build is the main one.  I would use it unless you have a
>> > > need to use the SBT build in particular.
>> > > On Nov 16, 2014 2:58 AM, "Dinesh J. Weerakkody" <
>> > > dineshjweerakk...@gmail.com> wrote:
>> > >
>> > >> Hi Yiming,
>> > >>
>> > >> I believe that both SBT and MVN is supported in SPARK, but SBT is
>> > >> preferred
>> > >> (I'm not 100% sure about this :) ). When I'm using MVN I got some
>> build
>> > >> failures. After that used SBT and works fine.
>> > >>
>> > >> You can go through these discussions regarding SBT vs MVN and learn
>> pros
>> > >> and cons of both [1] [2].
>> > >>
>> > >> [1]
>> > >>
>> > >>
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
>> > >>
>> > >> [2]
>> > >>
>> > >>
>> >
>> https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
>> > >>
>> > >> Thanks,
>> > >>
>> > >> On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang <
>> sdi...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> >
>> > >> >
>> > >> > I am new in developing Spark and my current focus is about
>> > >> co-scheduling of
>> > >> > spark tasks. However, I am confused with the building tools:
>> sometimes
>> > >> the
>> > >> > doc

Re: Has anyone else observed this build break?

2014-11-15 Thread Patrick Wendell
Sounds like this is pretty specific to my environment so not a big
deal then. However, if we can safely exclude those packages it's worth
doing.

On Sat, Nov 15, 2014 at 7:27 AM, Ted Yu  wrote:
> I couldn't reproduce the problem using:
>
> java version "1.6.0_65"
> Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
> Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
>
> Since hbase-annotations is a transitive dependency, I created the following
> pull request to exclude it from various hbase modules:
> https://github.com/apache/spark/pull/3286
>
> Cheers
>
> https://github.com/apache/spark/pull/3286
>
> On Sat, Nov 15, 2014 at 6:56 AM, Ted Yu  wrote:
>>
>> Sorry for the late reply.
>>
>> I tested my patch on Mac with the following JDK:
>>
>> java version "1.7.0_60"
>> Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
>>
>> Let me see if the problem can be solved upstream in HBase
>> hbase-annotations module.
>>
>> Cheers
>>
>> On Fri, Nov 14, 2014 at 12:32 PM, Patrick Wendell 
>> wrote:
>>>
>>> I think in this case we can probably just drop that dependency, so
>>> there is a simpler fix. But mostly I'm curious whether anyone else has
>>> observed this.
>>>
>>> On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
>>>  wrote:
>>> > Seems like a comment on that page mentions a fix, which would add yet
>>> > another profile though -- specifically telling mvn that if it is an
>>> > apple
>>> > jdk, use the classes.jar as the tools.jar as well, since Apple-packaged
>>> > JDK
>>> > 6 bundled them together.
>>> >
>>> > Link:
>>> > http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
>>> >
>>> > I didn't test it, but maybe this can fix it?
>>> >
>>> > Thanks,
>>> > Hari
>>> >
>>> >
>>> > On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell 
>>> > wrote:
>>> >>
>>> >> A work around for this fix is identified here:
>>> >>
>>> >>
>>> >> http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
>>> >>
>>> >> However, if this affects more users I'd prefer to just fix it properly
>>> >> in our build.
>>> >>
>>> >> On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell 
>>> >> wrote:
>>> >> > A recent patch broke clean builds for me, I am trying to see how
>>> >> > widespread this issue is and whether we need to revert the patch.
>>> >> >
>>> >> > The error I've seen is this when building the examples project:
>>> >> >
>>> >> > spark-examples_2.10: Could not resolve dependencies for project
>>> >> > org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
>>> >> > find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
>>> >> >
>>> >> >
>>> >> > /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>>> >> >
>>> >> > The reason for this error is that hbase-annotations is using a
>>> >> > "system" scoped dependency in their hbase-annotations pom, and this
>>> >> > doesn't work with certain JDK layouts such as that provided on Mac
>>> >> > OS:
>>> >> >
>>> >> >
>>> >> >
>>> >> > http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>>> >> >
>>> >> > Has anyone else seen this or is it just me?
>>> >> >
>>> >> > - Patrick
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>>> >>
>>> >
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
I think in this case we can probably just drop that dependency, so
there is a simpler fix. But mostly I'm curious whether anyone else has
observed this.

On Fri, Nov 14, 2014 at 12:24 PM, Hari Shreedharan
 wrote:
> Seems like a comment on that page mentions a fix, which would add yet
> another profile though -- specifically telling mvn that if it is an apple
> jdk, use the classes.jar as the tools.jar as well, since Apple-packaged JDK
> 6 bundled them together.
>
> Link:
> http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320
>
> I didn't test it, but maybe this can fix it?
>
> Thanks,
> Hari
>
>
> On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell 
> wrote:
>>
>> A work around for this fix is identified here:
>>
>> http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
>>
>> However, if this affects more users I'd prefer to just fix it properly
>> in our build.
>>
>> On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell 
>> wrote:
>> > A recent patch broke clean builds for me, I am trying to see how
>> > widespread this issue is and whether we need to revert the patch.
>> >
>> > The error I've seen is this when building the examples project:
>> >
>> > spark-examples_2.10: Could not resolve dependencies for project
>> > org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
>> > find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
>> >
>> > /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>> >
>> > The reason for this error is that hbase-annotations is using a
>> > "system" scoped dependency in their hbase-annotations pom, and this
>> > doesn't work with certain JDK layouts such as that provided on Mac OS:
>> >
>> >
>> > http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>> >
>> > Has anyone else seen this or is it just me?
>> >
>> > - Patrick
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
A work around for this fix is identified here:
http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html

However, if this affects more users I'd prefer to just fix it properly
in our build.

On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell  wrote:
> A recent patch broke clean builds for me, I am trying to see how
> widespread this issue is and whether we need to revert the patch.
>
> The error I've seen is this when building the examples project:
>
> spark-examples_2.10: Could not resolve dependencies for project
> org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
> find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
> /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar
>
> The reason for this error is that hbase-annotations is using a
> "system" scoped dependency in their hbase-annotations pom, and this
> doesn't work with certain JDK layouts such as that provided on Mac OS:
>
> http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom
>
> Has anyone else seen this or is it just me?
>
> - Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Has anyone else observed this build break?

2014-11-14 Thread Patrick Wendell
A recent patch broke clean builds for me, I am trying to see how
widespread this issue is and whether we need to revert the patch.

The error I've seen is this when building the examples project:

spark-examples_2.10: Could not resolve dependencies for project
org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar

The reason for this error is that hbase-annotations is using a
"system" scoped dependency in their hbase-annotations pom, and this
doesn't work with certain JDK layouts such as that provided on Mac OS:

http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom

Has anyone else seen this or is it just me?

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
> That's true, but note the code I posted activates a profile based on
> the lack of a property being set, which is why it works. Granted, I
> did not test that if you activate the other profile, the one with the
> property check will be disabled.

Ah yeah good call - I so then we'd trigger 2.11-vs-not based on the
presence of -Dscala-2.11.

Would that fix this issue then? It might be a simpler fix to merge
into the 1.2 branch than Sandy's patch since we're pretty late in the
game (though that patch does other things separately that I'd like to
see end up in Spark soon).

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-13 Thread Patrick Wendell
Hey Marcelo,

I'm not sure chaining activation works like that. At least in my
experience activation based on properties only works for properties
explicitly specified at the command line rather than declared
elsewhere in the pom.

https://gist.github.com/pwendell/6834223e68f254e6945e

I any case, I think Prashant just didn't document that his patch
required -Pscala-2.10 explicitly, which is what he said further up in
the thread. And Sandy has a solution that has better behavior than
that, which is nice.

- Patrick

On Thu, Nov 13, 2014 at 10:15 AM, Sandy Ryza  wrote:
> https://github.com/apache/spark/pull/3239 addresses this
>
> On Thu, Nov 13, 2014 at 10:05 AM, Marcelo Vanzin 
> wrote:
>>
>> Hello there,
>>
>> So I just took a quick look at the pom and I see two problems with it.
>>
>> - "activatedByDefault" does not work like you think it does. It only
>> "activates by default" if you do not explicitly activate other
>> profiles. So if you do "mvn package", scala-2.10 will be activated;
>> but if you do "mvn -Pyarn package", it will not.
>>
>> - you need to duplicate the "activation" stuff everywhere where the
>> profile is declared, not just in the root pom. (I spent quite some
>> time yesterday fighting a similar issue...)
>>
>> My suggestion here is to change the activation of scala-2.10 to look like
>> this:
>>
>> 
>>   
>> !scala-2.11
>>   
>> 
>>
>> And change the scala-2.11 profile to do this:
>>
>> 
>>   true
>> 
>>
>> I haven't tested, but in my experience this will activate the
>> scala-2.10 profile by default, unless you explicitly activate the 2.11
>> profile, in which case that property will be set and scala-2.10 will
>> not activate. If you look at examples/pom.xml, that's the same
>> strategy used to choose which hbase profile to activate.
>>
>> Ah, and just to reinforce, the activation logic needs to be copied to
>> other places (e.g. examples/pom.xml, repl/pom.xml, and any other place
>> that has scala-2.x profiles).
>>
>>
>>
>> On Wed, Nov 12, 2014 at 11:14 PM, Patrick Wendell 
>> wrote:
>> > I actually do agree with this - let's see if we can find a solution
>> > that doesn't regress this behavior. Maybe we can simply move the one
>> > kafka example into its own project instead of having it in the
>> > examples project.
>> >
>> > On Wed, Nov 12, 2014 at 11:07 PM, Sandy Ryza 
>> > wrote:
>> >> Currently there are no mandatory profiles required to build Spark.
>> >> I.e.
>> >> "mvn package" just works.  It seems sad that we would need to break
>> >> this.
>> >>
>> >> On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
>> >> wrote:
>> >>>
>> >>> I think printing an error that says "-Pscala-2.10 must be enabled" is
>> >>> probably okay. It's a slight regression but it's super obvious to
>> >>> users. That could be a more elegant solution than the somewhat
>> >>> complicated monstrosity I proposed on the JIRA.
>> >>>
>> >>> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma
>> >>> 
>> >>> wrote:
>> >>> > One thing we can do it is print a helpful error and break. I don't
>> >>> > know
>> >>> > about how this can be done, but since now I can write groovy inside
>> >>> > maven
>> >>> > build so we have more control. (Yay!!)
>> >>> >
>> >>> > Prashant Sharma
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell
>> >>> > 
>> >>> > wrote:
>> >>> >>
>> >>> >> Yeah Sandy and I were chatting about this today and din't realize
>> >>> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I
>> >>> >> was
>> >>> >> thinking maybe we could try to remove that. Also if someone doesn't
>> >>> >> give -Pscala-2.10 it fails in a way that is initially silent, which
>> >>> >> is
>> >>> >> bad because most people won't know to do this.
>> >>> >>
>> >>> >> https://issues.apache.org/jira/browse/SPARK-4375
>> >>> >>
>> >>> &

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
I actually do agree with this - let's see if we can find a solution
that doesn't regress this behavior. Maybe we can simply move the one
kafka example into its own project instead of having it in the
examples project.

On Wed, Nov 12, 2014 at 11:07 PM, Sandy Ryza  wrote:
> Currently there are no mandatory profiles required to build Spark.  I.e.
> "mvn package" just works.  It seems sad that we would need to break this.
>
> On Wed, Nov 12, 2014 at 10:59 PM, Patrick Wendell 
> wrote:
>>
>> I think printing an error that says "-Pscala-2.10 must be enabled" is
>> probably okay. It's a slight regression but it's super obvious to
>> users. That could be a more elegant solution than the somewhat
>> complicated monstrosity I proposed on the JIRA.
>>
>> On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma 
>> wrote:
>> > One thing we can do it is print a helpful error and break. I don't know
>> > about how this can be done, but since now I can write groovy inside
>> > maven
>> > build so we have more control. (Yay!!)
>> >
>> > Prashant Sharma
>> >
>> >
>> >
>> > On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell 
>> > wrote:
>> >>
>> >> Yeah Sandy and I were chatting about this today and din't realize
>> >> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
>> >> thinking maybe we could try to remove that. Also if someone doesn't
>> >> give -Pscala-2.10 it fails in a way that is initially silent, which is
>> >> bad because most people won't know to do this.
>> >>
>> >> https://issues.apache.org/jira/browse/SPARK-4375
>> >>
>> >> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma
>> >> 
>> >> wrote:
>> >> > Thanks Patrick, I have one suggestion that we should make passing
>> >> > -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
>> >> > this
>> >> > before. There is no way around not passing that option for maven
>> >> > users(only). However, this is unnecessary for sbt users because it is
>> >> > added
>> >> > automatically if -Pscala-2.11 is absent.
>> >> >
>> >> >
>> >> > Prashant Sharma
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen 
>> >> > wrote:
>> >> >
>> >> >> - Tip: when you rebase, IntelliJ will temporarily think things like
>> >> >> the
>> >> >> Kafka module are being removed. Say 'no' when it asks if you want to
>> >> >> remove
>> >> >> them.
>> >> >> - Can we go straight to Scala 2.11.4?
>> >> >>
>> >> >> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell
>> >> >> 
>> >> >> wrote:
>> >> >>
>> >> >> > Hey All,
>> >> >> >
>> >> >> > I've just merged a patch that adds support for Scala 2.11 which
>> >> >> > will
>> >> >> > have some minor implications for the build. These are due to the
>> >> >> > complexities of supporting two versions of Scala in a single
>> >> >> > project.
>> >> >> >
>> >> >> > 1. The JDBC server will now require a special flag to build
>> >> >> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> >> >> > because some build permutations (only in Scala 2.11) won't support
>> >> >> > the
>> >> >> > JDBC server yet due to transitive dependency conflicts.
>> >> >> >
>> >> >> > 2. The build now uses non-standard source layouts in a few
>> >> >> > additional
>> >> >> > places (we already did this for the Hive project) - the repl and
>> >> >> > the
>> >> >> > examples modules. This is just fine for maven/sbt, but it may
>> >> >> > affect
>> >> >> > users who import the build in IDE's that are using these projects
>> >> >> > and
>> >> >> > want to build Spark from the IDE. I'm going to update our wiki to
>> >> >> > include full instructions for making this work well in IntelliJ.
>> >> >> >
>> >> >> > If there are any other build related issues please respond to this
>> >> >> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> >> >> > Sharma who is the author of this feature!
>> >> >> >
>> >> >> > - Patrick
>> >> >> >
>> >> >> >
>> >> >> > -
>> >> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >> >
>> >> >> >
>> >> >>
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
I think printing an error that says "-Pscala-2.10 must be enabled" is
probably okay. It's a slight regression but it's super obvious to
users. That could be a more elegant solution than the somewhat
complicated monstrosity I proposed on the JIRA.

On Wed, Nov 12, 2014 at 10:37 PM, Prashant Sharma  wrote:
> One thing we can do it is print a helpful error and break. I don't know
> about how this can be done, but since now I can write groovy inside maven
> build so we have more control. (Yay!!)
>
> Prashant Sharma
>
>
>
> On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell 
> wrote:
>>
>> Yeah Sandy and I were chatting about this today and din't realize
>> -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
>> thinking maybe we could try to remove that. Also if someone doesn't
>> give -Pscala-2.10 it fails in a way that is initially silent, which is
>> bad because most people won't know to do this.
>>
>> https://issues.apache.org/jira/browse/SPARK-4375
>>
>> On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma 
>> wrote:
>> > Thanks Patrick, I have one suggestion that we should make passing
>> > -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
>> > this
>> > before. There is no way around not passing that option for maven
>> > users(only). However, this is unnecessary for sbt users because it is
>> > added
>> > automatically if -Pscala-2.11 is absent.
>> >
>> >
>> > Prashant Sharma
>> >
>> >
>> >
>> > On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen  wrote:
>> >
>> >> - Tip: when you rebase, IntelliJ will temporarily think things like the
>> >> Kafka module are being removed. Say 'no' when it asks if you want to
>> >> remove
>> >> them.
>> >> - Can we go straight to Scala 2.11.4?
>> >>
>> >> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell 
>> >> wrote:
>> >>
>> >> > Hey All,
>> >> >
>> >> > I've just merged a patch that adds support for Scala 2.11 which will
>> >> > have some minor implications for the build. These are due to the
>> >> > complexities of supporting two versions of Scala in a single project.
>> >> >
>> >> > 1. The JDBC server will now require a special flag to build
>> >> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> >> > because some build permutations (only in Scala 2.11) won't support
>> >> > the
>> >> > JDBC server yet due to transitive dependency conflicts.
>> >> >
>> >> > 2. The build now uses non-standard source layouts in a few additional
>> >> > places (we already did this for the Hive project) - the repl and the
>> >> > examples modules. This is just fine for maven/sbt, but it may affect
>> >> > users who import the build in IDE's that are using these projects and
>> >> > want to build Spark from the IDE. I'm going to update our wiki to
>> >> > include full instructions for making this work well in IntelliJ.
>> >> >
>> >> > If there are any other build related issues please respond to this
>> >> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> >> > Sharma who is the author of this feature!
>> >> >
>> >> > - Patrick
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >
>> >> >
>> >>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Patrick Wendell
Yeah Sandy and I were chatting about this today and din't realize
-Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
thinking maybe we could try to remove that. Also if someone doesn't
give -Pscala-2.10 it fails in a way that is initially silent, which is
bad because most people won't know to do this.

https://issues.apache.org/jira/browse/SPARK-4375

On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma  wrote:
> Thanks Patrick, I have one suggestion that we should make passing
> -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning this
> before. There is no way around not passing that option for maven
> users(only). However, this is unnecessary for sbt users because it is added
> automatically if -Pscala-2.11 is absent.
>
>
> Prashant Sharma
>
>
>
> On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen  wrote:
>
>> - Tip: when you rebase, IntelliJ will temporarily think things like the
>> Kafka module are being removed. Say 'no' when it asks if you want to remove
>> them.
>> - Can we go straight to Scala 2.11.4?
>>
>> On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell 
>> wrote:
>>
>> > Hey All,
>> >
>> > I've just merged a patch that adds support for Scala 2.11 which will
>> > have some minor implications for the build. These are due to the
>> > complexities of supporting two versions of Scala in a single project.
>> >
>> > 1. The JDBC server will now require a special flag to build
>> > -Phive-thriftserver on top of the existing flag -Phive. This is
>> > because some build permutations (only in Scala 2.11) won't support the
>> > JDBC server yet due to transitive dependency conflicts.
>> >
>> > 2. The build now uses non-standard source layouts in a few additional
>> > places (we already did this for the Hive project) - the repl and the
>> > examples modules. This is just fine for maven/sbt, but it may affect
>> > users who import the build in IDE's that are using these projects and
>> > want to build Spark from the IDE. I'm going to update our wiki to
>> > include full instructions for making this work well in IntelliJ.
>> >
>> > If there are any other build related issues please respond to this
>> > thread and we'll make sure they get sorted out. Thanks to Prashant
>> > Sharma who is the author of this feature!
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[NOTICE] [BUILD] Minor changes to Spark's build

2014-11-11 Thread Patrick Wendell
Hey All,

I've just merged a patch that adds support for Scala 2.11 which will
have some minor implications for the build. These are due to the
complexities of supporting two versions of Scala in a single project.

1. The JDBC server will now require a special flag to build
-Phive-thriftserver on top of the existing flag -Phive. This is
because some build permutations (only in Scala 2.11) won't support the
JDBC server yet due to transitive dependency conflicts.

2. The build now uses non-standard source layouts in a few additional
places (we already did this for the Hive project) - the repl and the
examples modules. This is just fine for maven/sbt, but it may affect
users who import the build in IDE's that are using these projects and
want to build Spark from the IDE. I'm going to update our wiki to
include full instructions for making this work well in IntelliJ.

If there are any other build related issues please respond to this
thread and we'll make sure they get sorted out. Thanks to Prashant
Sharma who is the author of this feature!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: JIRA + PR backlog

2014-11-11 Thread Patrick Wendell
I wonder if we should be linking to that dashboard somewhere from our
official docs or the wiki...

On Tue, Nov 11, 2014 at 12:23 PM, Nicholas Chammas
 wrote:
> Yeah, kudos to Josh for putting that together.
>
> On Tue, Nov 11, 2014 at 3:26 AM, Yu Ishikawa 
> wrote:
>
>> Great jobs!
>> I didn't know "Spark PR Dashboard."
>>
>> Thanks
>> Yu Ishikawa
>>
>>
>>
>> -
>> -- Yu Ishikawa
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/JIRA-PR-backlog-tp9157p9282.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: getting exception when trying to build spark from master

2014-11-10 Thread Patrick Wendell
I reverted that patch to see if it fixes it.

On Mon, Nov 10, 2014 at 1:45 PM, Josh Rosen  wrote:
> It looks like the Jenkins maven builds are broken, too.  Based on the
> Jenkins logs, I think that this pull request may have broken things
> (although I'm not sure why):
>
> https://github.com/apache/spark/pull/3030#issuecomment-62436181
>
> On Mon, Nov 10, 2014 at 1:42 PM, Sadhan Sood  wrote:
>
>> Getting an exception while trying to build spark in spark-core:
>>
>> [ERROR]
>>
>>  while compiling:
>>
>> /Users/dev/tellapart_spark/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala
>>
>> during phase: typer
>>
>>  library version: version 2.10.4
>>
>> compiler version: version 2.10.4
>>
>>   reconstructed args: -deprecation -feature -classpath
>>
>>
>>   last tree to typer: Ident(enumDispatcher)
>>
>>   symbol: value enumDispatcher (flags: )
>>
>>symbol definition: val enumDispatcher:
>> java.util.EnumSet[javax.servlet.DispatcherType]
>>
>>  tpe: java.util.EnumSet[javax.servlet.DispatcherType]
>>
>>symbol owners: value enumDispatcher -> value $anonfun -> method
>> addFilters -> object JettyUtils -> package ui
>>
>>   context owners: value $anonfun -> value $anonfun -> method addFilters
>> -> object JettyUtils -> package ui
>>
>>
>> == Enclosing template or block ==
>>
>>
>> Block(
>>
>>   ValDef( // val filters: Array[String]
>>
>> 
>>
>> "filters"
>>
>> AppliedTypeTree(
>>
>>   "Array"
>>
>>   "String"
>>
>> )
>>
>> Apply(
>>
>>   conf.get("spark.ui.filters", "").split(',')."map"
>>
>>   Function( // val $anonfun: , tree.tpe=String => String
>>
>> ValDef( // x$1: String
>>
>> 
>>
>>   "x$1"
>>
>>// tree.tpe=String
>>
>>   
>>
>> )
>>
>> Apply( // def trim(): String in class String, tree.tpe=String
>>
>>   "x$1"."trim" // def trim(): String in class String,
>> tree.tpe=()String
>>
>>   Nil
>>
>> )
>>
>>   )
>>
>> )
>>
>>   )
>>
>>   Apply(
>>
>> "filters"."foreach"
>>
>> Match(
>>
>>   
>>
>>   CaseDef(
>>
>> Bind( // val filter: String
>>
>>   "filter"
>>
>>   Typed(
>>
>> "_" // tree.tpe=String
>>
>> "String"
>>
>>   )
>>
>> )
>>
>> If(
>>
>>   "filter"."isEmpty"."unary_$bang"
>>
>>   Block(
>>
>> // 7 statements
>>
>> Apply(
>>
>>   "logInfo"
>>
>>   Apply( // final def +(x$1: Any): String in class String,
>> tree.tpe=String
>>
>> "Adding filter: "."$plus" // final def +(x$1: Any): String
>> in class String, tree.tpe=(x$1: Any)String
>>
>> "filter" // val filter: String, tree.tpe=String
>>
>>   )
>>
>> )
>>
>> ValDef( // val holder: org.eclipse.jetty.servlet.FilterHolder
>>
>>   
>>
>>   "holder"
>>
>>   "FilterHolder"
>>
>>   Apply(
>>
>> new FilterHolder.""
>>
>> Nil
>>
>>   )
>>
>> )
>>
>> Apply( // def setClassName(x$1: String): Unit in class Holder,
>> tree.tpe=Unit
>>
>>   "holder"."setClassName" // def setClassName(x$1: String):
>> Unit in class Holder, tree.tpe=(x$1: String)Unit
>>
>>   "filter" // val filter: String, tree.tpe=String
>>
>> )
>>
>> Apply(
>>
>>   conf.get("spark.".+(filter).+(".params"),
>> "").split(',').map(((x$2: String) => x$2.trim()))."toSet"."foreach"
>>
>>   Function( // val $anonfun: 
>>
>> ValDef( // param: String
>>
>>
>>
>>   "param"
>>
>>   "String"
>>
>>   
>>
>> )
>>
>> If(
>>
>>   "param"."isEmpty"."unary_$bang"
>>
>>   Block(
>>
>> ValDef( // val parts: Array[String]
>>
>>   
>>
>>   "parts"
>>
>>// tree.tpe=Array[String]
>>
>>   Apply( // def split(x$1: String): Array[String] in
>> class String, tree.tpe=Array[String]
>>
>> "param"."split" // def split(x$1: String):
>> Array[String] in class String, tree.tpe=(x$1: String)Array[String]
>>
>> "="
>>
>>   )
>>
>> )
>>
>> If(
>>
>>   Apply( // def ==(x: Int): Boolean in class Int,
>> tree.tpe=Boolean
>>
>> "parts"."length"."$eq$eq" // def ==(x: Int):
>> Boolean in class Int, tree.tpe=(x: Int)Boolean
>>
>> 2
>>
>>   )
>>
>>   Apply( // def setInitParameter(x$1: String,x$2:
>> String): Unit in class Holder
>>
>> "holder"."setInitParameter" 

Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-08 Thread Patrick Wendell
Great - I think that should work, but if there are any issues we can
definitely fix them up.

On Sat, Nov 8, 2014 at 12:47 AM, Sean Owen  wrote:
> Oops, that was my mistake. I moved network/shuffle into yarn, when
> it's just that network/yarn should be removed from yarn-alpha. That
> makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for
> the change.
>
> On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell  wrote:
>> This second error is something else. Maybe you are excluding
>> network-shuffle instead of spark-network-yarn?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-08 Thread Patrick Wendell
I think you might be conflating two things. The first error you posted
was because YARN didn't standardize the shuffle API in alpha versions
so our spark-network-yarn module won't compile. We should just disable
that module if yarn alpha is used. spark-network-yarn is a leaf in the
intra-module dependency graph, and core doesn't depend on it.

This second error is something else. Maybe you are excluding
network-shuffle instead of spark-network-yarn?



On Fri, Nov 7, 2014 at 11:50 PM, Sean Owen  wrote:
> Hm. Problem is, core depends directly on it:
>
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
> object sasl is not a member of package org.apache.spark.network
> [error] import org.apache.spark.network.sasl.SecretKeyHolder
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:147:
> not found: type SecretKeyHolder
> [error] private[spark] class SecurityManager(sparkConf: SparkConf)
> extends Logging with SecretKeyHolder {
> [error]
>  ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala:29:
> object RetryingBlockFetcher is not a member of package
> org.apache.spark.network.shuffle
> [error] import org.apache.spark.network.shuffle.{RetryingBlockFetcher,
> BlockFetchingListener, OneForOneBlockFetcher}
> [error]^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/deploy/worker/StandaloneWorkerShuffleService.scala:23:
> object sasl is not a member of package org.apache.spark.network
> [error] import org.apache.spark.network.sasl.SaslRpcHandler
> [error]
>
> ...
>
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:124:
> too many arguments for constructor ExternalShuffleClient: (x$1:
> org.apache.spark.network.util.TransportConf, x$2:
> String)org.apache.spark.network.shuffle.ExternalShuffleClient
> [error] new
> ExternalShuffleClient(SparkTransportConf.fromSparkConf(conf),
> securityManager,
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:39:
> object protocol is not a member of package
> org.apache.spark.network.shuffle
> [error] import org.apache.spark.network.shuffle.protocol.ExecutorShuffleInfo
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/storage/BlockManager.scala:214:
> not found: type ExecutorShuffleInfo
> [error] val shuffleConfig = new ExecutorShuffleInfo(
> [error]
> ...
>
>
> More refactoring needed? Either to support YARN alpha as a separate
> shuffle module, or sever this dependency?
>
> Of course this goes away when yarn-alpha goes away too.
>
>
> On Sat, Nov 8, 2014 at 7:45 AM, Patrick Wendell  wrote:
>> I bet it doesn't work. +1 on isolating it's inclusion to only the
>> newer YARN API's.
>>
>> - Patrick
>>
>> On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen  wrote:
>>> I noticed that this doesn't compile:
>>>
>>> mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean 
>>> package
>>>
>>> [error] warning: [options] bootstrap class path not set in conjunction
>>> with -source 1.6
>>> [error] 
>>> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26:
>>> error: cannot find symbol
>>> [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService;
>>> [error] ^
>>> [error]   symbol:   class AuxiliaryService
>>> [error]   location: package org.apache.hadoop.yarn.server.api
>>> [error] 
>>> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27:
>>> error: cannot find symbol
>>> [error] import 
>>> org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
>>> [error] ^
>>> ...
>>>
>>> Should it work? if not shall I propose to enable the service only with 
>>> -Pyarn?
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should new YARN shuffle service work with "yarn-alpha"?

2014-11-07 Thread Patrick Wendell
I bet it doesn't work. +1 on isolating it's inclusion to only the
newer YARN API's.

- Patrick

On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen  wrote:
> I noticed that this doesn't compile:
>
> mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean 
> package
>
> [error] warning: [options] bootstrap class path not set in conjunction
> with -source 1.6
> [error] 
> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:26:
> error: cannot find symbol
> [error] import org.apache.hadoop.yarn.server.api.AuxiliaryService;
> [error] ^
> [error]   symbol:   class AuxiliaryService
> [error]   location: package org.apache.hadoop.yarn.server.api
> [error] 
> /Users/srowen/Documents/spark/network/yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:27:
> error: cannot find symbol
> [error] import 
> org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
> [error] ^
> ...
>
> Should it work? if not shall I propose to enable the service only with -Pyarn?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
In fact, if you look at the subversion commiter list, the majority of
people here have commit access only for particular areas of the
project:

http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS

On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell  wrote:
> Hey Greg,
>
> Regarding subversion - I think the reference is to partial vs full
> committers here:
> https://subversion.apache.org/docs/community-guide/roles.html
>
> - Patrick
>
> On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  wrote:
>> -1 (non-binding)
>>
>> This is an idea that runs COMPLETELY counter to the Apache Way, and is
>> to be severely frowned up. This creates *unequal* ownership of the
>> codebase.
>>
>> Each Member of the PMC should have *equal* rights to all areas of the
>> codebase until their purview. It should not be subjected to others'
>> "ownership" except throught the standard mechanisms of reviews and
>> if/when absolutely necessary, to vetos.
>>
>> Apache does not want "leads", "benevolent dictators" or "assigned
>> maintainers", no matter how you may dress it up with multiple
>> maintainers per component. The fact is that this creates an unequal
>> level of ownership and responsibility. The Board has shut down
>> projects that attempted or allowed for "Leads". Just a few months ago,
>> there was a problem with somebody calling themself a "Lead".
>>
>> I don't know why you suggest that Apache Subversion does this. We
>> absolutely do not. Never have. Never will. The Subversion codebase is
>> owned by all of us, and we all care for every line of it. Some people
>> know more than others, of course. But any one of us, can change any
>> part, without being subjected to a "maintainer". Of course, we ask
>> people with more knowledge of the component when we feel
>> uncomfortable, but we also know when it is safe or not to make a
>> specific change. And *always*, our fellow committers can review our
>> work and let us know when we've done something wrong.
>>
>> Equal ownership reduces fiefdoms, enhances a feeling of community and
>> project ownership, and creates a more open and inviting project.
>>
>> So again: -1 on this entire concept. Not good, to be polite.
>>
>> Regards,
>> Greg Stein
>> Director, Vice Chairman
>> Apache Software Foundation
>>
>> On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
>>> Hi all,
>>>
>>> I wanted to share a discussion we've been having on the PMC list, as well 
>>> as call for an official vote on it on a public list. Basically, as the 
>>> Spark project scales up, we need to define a model to make sure there is 
>>> still great oversight of key components (in particular internal 
>>> architecture and public APIs), and to this end I've proposed implementing a 
>>> maintainer model for some of these components, similar to other large 
>>> projects.
>>>
>>> As background on this, Spark has grown a lot since joining Apache. We've 
>>> had over 80 contributors/month for the past 3 months, which I believe makes 
>>> us the most active project in contributors/month at Apache, as well as over 
>>> 500 patches/month. The codebase has also grown significantly, with new 
>>> libraries for SQL, ML, graphs and more.
>>>
>>> In this kind of large project, one common way to scale development is to 
>>> assign "maintainers" to oversee key components, where each patch to that 
>>> component needs to get sign-off from at least one of its maintainers. Most 
>>> existing large projects do this -- at Apache, some large ones with this 
>>> model are CloudStack (the second-most active project overall), Subversion, 
>>> and Kafka, and other examples include Linux and Python. This is also 
>>> by-and-large how Spark operates today -- most components have a de-facto 
>>> maintainer.
>>>
>>> IMO, adopting this model would have two benefits:
>>>
>>> 1) Consistent oversight of design for that component, especially regarding 
>>> architecture and API. This process would ensure that the component's 
>>> maintainers see all proposed changes and consider them to fit together in a 
>>> good way.
>>>
>>> 2) More structure for new contributors and committers -- in particular, it 
>>> would be easy to look up who's responsible for each module and ask them for 
>>> reviews, etc, rather than having patches slip between the cracks.

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
Hey Greg,

Regarding subversion - I think the reference is to partial vs full
committers here:
https://subversion.apache.org/docs/community-guide/roles.html

- Patrick

On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  wrote:
> -1 (non-binding)
>
> This is an idea that runs COMPLETELY counter to the Apache Way, and is
> to be severely frowned up. This creates *unequal* ownership of the
> codebase.
>
> Each Member of the PMC should have *equal* rights to all areas of the
> codebase until their purview. It should not be subjected to others'
> "ownership" except throught the standard mechanisms of reviews and
> if/when absolutely necessary, to vetos.
>
> Apache does not want "leads", "benevolent dictators" or "assigned
> maintainers", no matter how you may dress it up with multiple
> maintainers per component. The fact is that this creates an unequal
> level of ownership and responsibility. The Board has shut down
> projects that attempted or allowed for "Leads". Just a few months ago,
> there was a problem with somebody calling themself a "Lead".
>
> I don't know why you suggest that Apache Subversion does this. We
> absolutely do not. Never have. Never will. The Subversion codebase is
> owned by all of us, and we all care for every line of it. Some people
> know more than others, of course. But any one of us, can change any
> part, without being subjected to a "maintainer". Of course, we ask
> people with more knowledge of the component when we feel
> uncomfortable, but we also know when it is safe or not to make a
> specific change. And *always*, our fellow committers can review our
> work and let us know when we've done something wrong.
>
> Equal ownership reduces fiefdoms, enhances a feeling of community and
> project ownership, and creates a more open and inviting project.
>
> So again: -1 on this entire concept. Not good, to be polite.
>
> Regards,
> Greg Stein
> Director, Vice Chairman
> Apache Software Foundation
>
> On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
>> Hi all,
>>
>> I wanted to share a discussion we've been having on the PMC list, as well as 
>> call for an official vote on it on a public list. Basically, as the Spark 
>> project scales up, we need to define a model to make sure there is still 
>> great oversight of key components (in particular internal architecture and 
>> public APIs), and to this end I've proposed implementing a maintainer model 
>> for some of these components, similar to other large projects.
>>
>> As background on this, Spark has grown a lot since joining Apache. We've had 
>> over 80 contributors/month for the past 3 months, which I believe makes us 
>> the most active project in contributors/month at Apache, as well as over 500 
>> patches/month. The codebase has also grown significantly, with new libraries 
>> for SQL, ML, graphs and more.
>>
>> In this kind of large project, one common way to scale development is to 
>> assign "maintainers" to oversee key components, where each patch to that 
>> component needs to get sign-off from at least one of its maintainers. Most 
>> existing large projects do this -- at Apache, some large ones with this 
>> model are CloudStack (the second-most active project overall), Subversion, 
>> and Kafka, and other examples include Linux and Python. This is also 
>> by-and-large how Spark operates today -- most components have a de-facto 
>> maintainer.
>>
>> IMO, adopting this model would have two benefits:
>>
>> 1) Consistent oversight of design for that component, especially regarding 
>> architecture and API. This process would ensure that the component's 
>> maintainers see all proposed changes and consider them to fit together in a 
>> good way.
>>
>> 2) More structure for new contributors and committers -- in particular, it 
>> would be easy to look up who's responsible for each module and ask them for 
>> reviews, etc, rather than having patches slip between the cracks.
>>
>> We'd like to start with in a light-weight manner, where the model only 
>> applies to certain key components (e.g. scheduler, shuffle) and user-facing 
>> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it 
>> if we deem it useful. The specific mechanics would be as follows:
>>
>> - Some components in Spark will have maintainers assigned to them, where one 
>> of the maintainers needs to sign off on each patch to the component.
>> - Each component with maintainers will have at least 2 maintainers.
>> - Maintainers will be assigned from the most active and knowledgeable 
>> committers on that component by the PMC. The PMC can vote to add / remove 
>> maintainers, and maintained components, through consensus.
>> - Maintainers are expected to be active in responding to patches for their 
>> components, though they do not need to be the main reviewers for them (e.g. 
>> they might just sign off on architecture / API). To prevent inactive 
>> maintainers from blocking the project, if a maintainer isn't responding in a 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Patrick Wendell
I think new committers might or might not be maintainers (it would
depend on the PMC vote). I don't think it would affect what you could
merge, you can merge in any part of the source tree, you just need to
get sign off if you want to touch a public API or make major
architectural changes. Most projects already require code review from
other committers before you commit something, so it's just a version
of that where you have specific people appointed to specific
components for review.

If you look, most large software projects have a maintainer model,
both in Apache and outside of it. Cloudstack is probably the best
example in Apache since they are the second most active project
(roughly) after Spark. They have two levels of maintainers and much
strong language - their language: "In general, maintainers only have
commit rights on the module for which they are responsible.".

I'd like us to start with something simpler and lightweight as
proposed here. Really the proposal on the table is just to codify the
current de-facto process to make sure we stick by it as we scale. If
we want to add more formality to it or strictness, we can do it later.

- Patrick

On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan
 wrote:
> How would this model work with a new committer who gets voted in? Does it 
> mean that a new committer would be a maintainer for at least one area -- else 
> we could end up having committers who really can't merge anything significant 
> until he becomes a maintainer.
>
>
> Thanks,
> Hari
>
> On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia 
> wrote:
>
>> I think you're misunderstanding the idea of "process" here. The point of 
>> process is to make sure something happens automatically, which is useful to 
>> ensure a certain level of quality. For example, all our patches go through 
>> Jenkins, and nobody will make the mistake of merging them if they fail 
>> tests, or RAT checks, or API compatibility checks. The idea is to get the 
>> same kind of automation for design on these components. This is a very 
>> common process for large software projects, and it's essentially what we had 
>> already, but formalizing it will make clear that this is the process we 
>> want. It's important to do it early in order to be able to refine the 
>> process as the project grows.
>> In terms of scope, again, the maintainers are *not* going to be the only 
>> reviewers for that component, they are just a second level of sign-off 
>> required for architecture and API. Being a maintainer is also not a 
>> "promotion", it's a responsibility. Since we don't have much experience yet 
>> with this model, I didn't propose automatic rules beyond that the PMC can 
>> add / remove maintainers -- presumably the PMC is in the best position to 
>> know what the project needs. I think automatic rules are exactly the kind of 
>> "process" you're arguing against. The "process" here is about ensuring 
>> certain checks are made for every code change, not about automating 
>> personnel and development decisions.
>> In any case, I appreciate your input on this, and we're going to evaluate 
>> the model to see how it goes. It might be that we decide we don't want it at 
>> all. However, from what I've seen of other projects (not Hadoop but projects 
>> with an order of magnitude more contributors, like Python or Linux), this is 
>> one of the best ways to have consistently great releases with a large 
>> contributor base and little room for error. With all due respect to what 
>> Hadoop's accomplished, I wouldn't use Hadoop as the best example to strive 
>> for; in my experience there I've seen patches reverted because of 
>> architectural disagreements, new APIs released and abandoned, and generally 
>> an experience that's been painful for users. A lot of the decisions we've 
>> made in Spark (e.g. time-based release cycle, built-in libraries, API 
>> stability rules, etc) were based on lessons learned there, in an attempt to 
>> define a better model.
>> Matei
>>> On Nov 6, 2014, at 2:18 PM, bc Wong  wrote:
>>>
>>> On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia >> > wrote:
>>> 
>>> Ultimately, the core motivation is that the project has grown to the point 
>>> where it's hard to expect every committer to have full understanding of 
>>> every component. Some committers know a ton about systems but little about 
>>> machine learning, some are algorithmic whizzes but may not realize the 
>>> implications of changing something on the Python API, etc. This is just a 
>>> way to make sure that a domain expert has looked at the areas where it is 
>>> most likely for something to go wrong.
>>>
>>> Hi Matei,
>>>
>>> I understand where you're coming from. My suggestion is to solve this 
>>> without adding a new process. In the example above, those "algo whizzes" 
>>> committers should realize that they're touching the Python API, and loop in 
>>> some Python maintainers. Those Python maintainers would then re

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Patrick Wendell
I'm a +1 on this as well, I think it will be a useful model as we
scale the project in the future and recognizes some informal process
we have now.

To respond to Sandy's comment: for changes that fall in between the
component boundaries or are straightforward, my understanding of this
model is you wouldn't need an explicit sign off. I think this is why
unlike some other projects, we wouldn't e.g. lock down permissions to
portions of the source tree. If some obvious fix needs to go in,
people should just merge it.

- Patrick

On Wed, Nov 5, 2014 at 5:57 PM, Sandy Ryza  wrote:
> This seems like a good idea.
>
> An area that wasn't listed, but that I think could strongly benefit from
> maintainers, is the build.  Having consistent oversight over Maven, SBT,
> and dependencies would allow us to avoid subtle breakages.
>
> Component maintainers have come up several times within the Hadoop project,
> and I think one of the main reasons the proposals have been rejected is
> that, structurally, its effect is to slow down development.  As you
> mention, this is somewhat mitigated if being a maintainer leads committers
> to take on more responsibility, but it might be worthwhile to draw up more
> specific ideas on how to combat this?  E.g. do obvious changes, doc fixes,
> test fixes, etc. always require a maintainer?
>
> -Sandy
>
> On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust 
> wrote:
>
>> +1 (binding)
>>
>> On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia 
>> wrote:
>>
>> > BTW, my own vote is obviously +1 (binding).
>> >
>> > Matei
>> >
>> > > On Nov 5, 2014, at 5:31 PM, Matei Zaharia 
>> > wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I wanted to share a discussion we've been having on the PMC list, as
>> > well as call for an official vote on it on a public list. Basically, as
>> the
>> > Spark project scales up, we need to define a model to make sure there is
>> > still great oversight of key components (in particular internal
>> > architecture and public APIs), and to this end I've proposed
>> implementing a
>> > maintainer model for some of these components, similar to other large
>> > projects.
>> > >
>> > > As background on this, Spark has grown a lot since joining Apache.
>> We've
>> > had over 80 contributors/month for the past 3 months, which I believe
>> makes
>> > us the most active project in contributors/month at Apache, as well as
>> over
>> > 500 patches/month. The codebase has also grown significantly, with new
>> > libraries for SQL, ML, graphs and more.
>> > >
>> > > In this kind of large project, one common way to scale development is
>> to
>> > assign "maintainers" to oversee key components, where each patch to that
>> > component needs to get sign-off from at least one of its maintainers.
>> Most
>> > existing large projects do this -- at Apache, some large ones with this
>> > model are CloudStack (the second-most active project overall),
>> Subversion,
>> > and Kafka, and other examples include Linux and Python. This is also
>> > by-and-large how Spark operates today -- most components have a de-facto
>> > maintainer.
>> > >
>> > > IMO, adopting this model would have two benefits:
>> > >
>> > > 1) Consistent oversight of design for that component, especially
>> > regarding architecture and API. This process would ensure that the
>> > component's maintainers see all proposed changes and consider them to fit
>> > together in a good way.
>> > >
>> > > 2) More structure for new contributors and committers -- in particular,
>> > it would be easy to look up who's responsible for each module and ask
>> them
>> > for reviews, etc, rather than having patches slip between the cracks.
>> > >
>> > > We'd like to start with in a light-weight manner, where the model only
>> > applies to certain key components (e.g. scheduler, shuffle) and
>> user-facing
>> > APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
>> > it if we deem it useful. The specific mechanics would be as follows:
>> > >
>> > > - Some components in Spark will have maintainers assigned to them,
>> where
>> > one of the maintainers needs to sign off on each patch to the component.
>> > > - Each component with maintainers will have at least 2 maintainers.
>> > > - Maintainers will be assigned from the most active and knowledgeable
>> > committers on that component by the PMC. The PMC can vote to add / remove
>> > maintainers, and maintained components, through consensus.
>> > > - Maintainers are expected to be active in responding to patches for
>> > their components, though they do not need to be the main reviewers for
>> them
>> > (e.g. they might just sign off on architecture / API). To prevent
>> inactive
>> > maintainers from blocking the project, if a maintainer isn't responding
>> in
>> > a reasonable time period (say 2 weeks), other committers can merge the
>> > patch, and the PMC will want to discuss adding another maintainer.
>> > >
>> > > If you'd like to see examples for this model, check out the following

branch-1.2 has been cut

2014-11-03 Thread Patrick Wendell
Hi All,

I've just cut the release branch for Spark 1.2, consistent with then
end of the scheduled feature window for the release. New commits to
master will need to be explicitly merged into branch-1.2 in order to
be in the release.

This begins the transition into a QA period for Spark 1.2, with a
focus on testing and fixes. A few smaller features may still go in as
folks wrap up loose ends in the next 48 hours (or for developments in
alpha components).

To help with QA, I'll try to package up a SNAPSHOT release soon for
community testing; this worked well when testing Spark 1.1 before
official votes started. I might give it a few days to allow committers
to merge in back-logged fixes and other patches that were punted to
after the feature freeze.

Thanks to everyone who helped author and review patches over the last few weeks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Patrick Wendell
By the way - we can report issues to the Scala/Typesafe team if we
have a way to reproduce this. I just haven't found a reliable
reproduction yet.

- Patrick

On Sun, Nov 2, 2014 at 7:48 PM, Stephen Boesch  wrote:
> Yes I have seen this same error - and for team members as well - repeatedly
> since June. A Patrick and Cheng mentioned, the next step is to do an sbt
> clean
>
> 2014-11-02 19:37 GMT-08:00 Cheng Lian :
>
>> I often see this when I first build the whole Spark project with SBT, then
>> modify some code and tries to build and debug within IDEA, or vice versa.
>> A
>> clean rebuild can always solve this.
>>
>> On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell 
>> wrote:
>>
>> > Does this happen if you clean and recompile? I've seen failures on and
>> > off, but haven't been able to find one that I could reproduce from a
>> > clean build such that we could hand it to the scala team.
>> >
>> > - Patrick
>> >
>> > On Sun, Nov 2, 2014 at 7:25 PM, Imran Rashid 
>> > wrote:
>> > > I'm finding the scala compiler crashes when I compile the spark-sql
>> > project
>> > > in sbt.  This happens in both the 1.1 branch and master (full error
>> > > below).  The other projects build fine in sbt, and everything builds
>> > > fine
>> > > in maven.  is there some sbt option I'm forgetting?  Any one else
>> > > experiencing this?
>> > >
>> > > Also, are there up-to-date instructions on how to do common dev tasks
>> > > in
>> > > both sbt & maven?  I have only found these instructions on building
>> > > with
>> > > maven:
>> > >
>> > > http://spark.apache.org/docs/latest/building-with-maven.html
>> > >
>> > > and some general info here:
>> > >
>> > >
>> > > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> > >
>> > > but I think this doesn't walk through a lot of the steps of a typical
>> > > dev
>> > > cycle, eg, continuous compilation, running one test, running one main
>> > > class, etc.  (especially since it seems like people still favor sbt
>> > > for
>> > > dev.)  If it doesn't already exist somewhere, I could try to put
>> > together a
>> > > brief doc for how to do the basics.  (I'm returning to spark dev after
>> > > a
>> > > little hiatus myself, and I'm hitting some stumbling blocks that are
>> > > probably common knowledge to everyone still dealing with it all the
>> > time.)
>> > >
>> > > thanks,
>> > > Imran
>> > >
>> > > --
>> > > full crash info from sbt:
>> > >
>> > >> project sql
>> > > [info] Set current project to spark-sql (in build
>> > > file:/Users/imran/spark/spark/)
>> > >> compile
>> > > [info] Compiling 62 Scala sources to
>> > > /Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes...
>> > > [info] Compiling 45 Scala sources and 39 Java sources to
>> > > /Users/imran/spark/spark/sql/core/target/scala-2.10/classes...
>> > > [error]
>> > > [error]  while compiling:
>> > >
>> >
>> > /Users/imran/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
>> > > [error] during phase: jvm
>> > > [error]  library version: version 2.10.4
>> > > [error] compiler version: version 2.10.4
>> > > [error]   reconstructed args: -classpath
>> > >
>> >
>> > /Users/imran/spark/spark/sql/core/target/scala-2.10/classes:/Users/imran/spark/spark/core/target/scala-2.10/classes:/Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes:/Users/imran/spark/spark/lib_managed/jars/hadoop-client-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/hadoop-core-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/xmlenc-0.52.jar:/Users/imran/spark/spark/lib_managed/jars/commons-math-2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-configuration-1.6.jar:/Users/imran/spark/spark/lib_managed/jars/commons-collections-3.2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-lang-2.4.jar:/Users/imran/spark/spark/lib_managed/jars/commons-logging-1.1.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-digester-1.8.jar:/Users/imran/spark/spark/lib_managed/jars/commons-beanutils-1.7.0.jar:/Users/imran/spark/spark/li

Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Patrick Wendell
Does this happen if you clean and recompile? I've seen failures on and
off, but haven't been able to find one that I could reproduce from a
clean build such that we could hand it to the scala team.

- Patrick

On Sun, Nov 2, 2014 at 7:25 PM, Imran Rashid  wrote:
> I'm finding the scala compiler crashes when I compile the spark-sql project
> in sbt.  This happens in both the 1.1 branch and master (full error
> below).  The other projects build fine in sbt, and everything builds fine
> in maven.  is there some sbt option I'm forgetting?  Any one else
> experiencing this?
>
> Also, are there up-to-date instructions on how to do common dev tasks in
> both sbt & maven?  I have only found these instructions on building with
> maven:
>
> http://spark.apache.org/docs/latest/building-with-maven.html
>
> and some general info here:
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>
> but I think this doesn't walk through a lot of the steps of a typical dev
> cycle, eg, continuous compilation, running one test, running one main
> class, etc.  (especially since it seems like people still favor sbt for
> dev.)  If it doesn't already exist somewhere, I could try to put together a
> brief doc for how to do the basics.  (I'm returning to spark dev after a
> little hiatus myself, and I'm hitting some stumbling blocks that are
> probably common knowledge to everyone still dealing with it all the time.)
>
> thanks,
> Imran
>
> --
> full crash info from sbt:
>
>> project sql
> [info] Set current project to spark-sql (in build
> file:/Users/imran/spark/spark/)
>> compile
> [info] Compiling 62 Scala sources to
> /Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes...
> [info] Compiling 45 Scala sources and 39 Java sources to
> /Users/imran/spark/spark/sql/core/target/scala-2.10/classes...
> [error]
> [error]  while compiling:
> /Users/imran/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
> [error] during phase: jvm
> [error]  library version: version 2.10.4
> [error] compiler version: version 2.10.4
> [error]   reconstructed args: -classpath
> /Users/imran/spark/spark/sql/core/target/scala-2.10/classes:/Users/imran/spark/spark/core/target/scala-2.10/classes:/Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes:/Users/imran/spark/spark/lib_managed/jars/hadoop-client-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/hadoop-core-1.0.4.jar:/Users/imran/spark/spark/lib_managed/jars/xmlenc-0.52.jar:/Users/imran/spark/spark/lib_managed/jars/commons-math-2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-configuration-1.6.jar:/Users/imran/spark/spark/lib_managed/jars/commons-collections-3.2.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-lang-2.4.jar:/Users/imran/spark/spark/lib_managed/jars/commons-logging-1.1.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-digester-1.8.jar:/Users/imran/spark/spark/lib_managed/jars/commons-beanutils-1.7.0.jar:/Users/imran/spark/spark/lib_managed/jars/commons-beanutils-core-1.8.0.jar:/Users/imran/spark/spark/lib_managed/jars/commons-net-2.2.jar:/Users/imran/spark/spark/lib_managed/jars/commons-el-1.0.jar:/Users/imran/spark/spark/lib_managed/jars/hsqldb-1.8.0.10.jar:/Users/imran/spark/spark/lib_managed/jars/oro-2.0.8.jar:/Users/imran/spark/spark/lib_managed/jars/jets3t-0.7.1.jar:/Users/imran/spark/spark/lib_managed/jars/commons-httpclient-3.1.jar:/Users/imran/spark/spark/lib_managed/bundles/curator-recipes-2.4.0.jar:/Users/imran/spark/spark/lib_managed/bundles/curator-framework-2.4.0.jar:/Users/imran/spark/spark/lib_managed/bundles/curator-client-2.4.0.jar:/Users/imran/spark/spark/lib_managed/jars/zookeeper-3.4.5.jar:/Users/imran/spark/spark/lib_managed/jars/slf4j-log4j12-1.7.5.jar:/Users/imran/spark/spark/lib_managed/bundles/log4j-1.2.17.jar:/Users/imran/spark/spark/lib_managed/jars/jline-0.9.94.jar:/Users/imran/spark/spark/lib_managed/bundles/guava-14.0.1.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-plus-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/orbits/javax.transaction-1.1.1.v201105210645.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-webapp-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-xml-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-util-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-servlet-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-security-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-server-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/orbits/javax.servlet-3.0.0.v201112011016.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-continuation-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-http-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-io-8.1.14.v20131031.jar:/Users/imran/spark/spark/lib_managed/jars/jetty-jndi-8.1.14.v20131031.jar

Changes to Spark's networking subsystem

2014-11-01 Thread Patrick Wendell
== Short version ==
A recent commit replaces Spark's networking subsystem with one based
on Netty rather than raw sockets. Users running off of master can
disable this change by setting
"spark.shuffle.blockTransferService=nio". We will be testing with this
during the QA period for Spark 1.2. The new implementation is designed
to increase stability and decrease GC pressure during shuffles.

== Long version ==
For those who haven't been following the associated PR's and JIRA's:

We recently merged PR #2753 which creates a "network" package which
does not depend on Spark core. #2753 introduces a Netty-based
BlockTransferService to replace the NIO-based ConnectionManager, used
for transferring shuffle and RDD cache blocks between Executors (in
other words, the transport layer of the BlockManager).

The new BlockTransferService is intended to provide increased
stability, decreased maintenance burden, and decreased garbage
collection. By relying on Netty to take care of the low-level
networking, the actual transfer code is simpler and easier to verify.
By making use of ByteBuf pooling, we can lower both memory usage and
memory churn by reusing buffers. This was actually a critical
component of the petasort benchmark, where the code originated from.

While building this component, we realized it was a good opportunity
to extract out the core transport functionality from Spark so we could
reuse it for SPARK-3796, which calls for an external service which can
serve Spark shuffle files. Thus, we created the "network/common"
package, containing the functionality for setting up a simple control
plane and an efficient data plane over a network. This part is
functionally independent from Spark and is in fact written in Java to
further minimize dependencies.

PR #3001 finishes the work of creating an external shuffle service by
creating a "network/shuffle" package which deals with serving Spark
shuffle files from outside of an executor. The intention is that this
server can be run anywhere -- including inside the Spark Standalone
Worker or the YARN NodeManager, or as a separate service inside Mesos
-- and provide the ability to scale up and down executors without
losing shuffle data.

Thanks to Aaron, Reynold and others who have worked on these
improvements over the last month.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Surprising Spark SQL benchmark

2014-10-31 Thread Patrick Wendell
Hey Nick,

Unfortunately Citus Data didn't contact any of the Spark or Spark SQL
developers when running this. It is really easy to make one system
look better than others when you are running a benchmark yourself
because tuning and sizing can lead to a 10X performance improvement.
This benchmark doesn't share the mechanism in a reproducible way.

There are a bunch of things that aren't clear here:

1. Spark SQL has optimized parquet features, were these turned on?
2. It doesn't mention computing statistics in Spark SQL, but it does
this for Impala and Parquet. Statistics allow Spark SQL to broadcast
small tables which can make a 10X difference in TPC-H.
3. For data larger than memory, Spark SQL often performs better if you
don't call "cache", did they try this?

Basically, a self-reported marketing benchmark like this that
*shocker* concludes this vendor's solution is the best, is not
particularly useful.

If Citus data wants to run a credible benchmark, I'd invite them to
directly involve Spark SQL developers in the future. Until then, I
wouldn't give much credence to this or any other similar vendor
benchmark.

- Patrick

On Fri, Oct 31, 2014 at 10:38 AM, Nicholas Chammas
 wrote:
> I know we don't want to be jumping at every benchmark someone posts out
> there, but this one surprised me:
>
> http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
>
> This benchmark has Spark SQL failing to complete several queries in the
> TPC-H benchmark. I don't understand much about the details of performing
> benchmarks, but this was surprising.
>
> Are these results expected?
>
> Related HN discussion here: https://news.ycombinator.com/item?id=8539678
>
> Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to run tests properly?

2014-10-30 Thread Patrick Wendell
Some of our tests actually require spinning up a small multi-process
spark cluster. These use the normal deployment codepath for Spark
which is that we rely on the spark "assembly jar" to be present. That
jar is generated when you run "mvn package" via a special sub project
called assembly in our build. This is a bit non-standard. The reason
is that some of our tests are really mini integration tests.

- Patrick

On Thu, Oct 30, 2014 at 4:36 AM, Sean Owen  wrote:
> You are right that this is a bit weird compared to the Maven lifecycle
> semantics. Maven wants assembly to come after tests but here tests want to
> launch the final assembly as part of some tests. Yes you would not normally
> have to do this in 2 stages.
>
> On Oct 30, 2014 12:28 PM, "Niklas Wilcke"
> <1wil...@informatik.uni-hamburg.de> wrote:
>>
>> Can you please briefly explain why packaging is necessary. I thought
>> packaging would only build the jar and place it in the target folder.
>> How does that affect the tests? If tests depend on the assembly a "mvn
>> install" would be more sensible to me.
>> Probably I misunderstand the maven build life-cycle.
>>
>> Thanks,
>> Niklas
>>
>> On 29.10.2014 19:01, Patrick Wendell wrote:
>> > One thing is you need to do a "maven package" before you run tests.
>> > The "local-cluster" tests depend on Spark already being packaged.
>> >
>> > - Patrick
>> >
>> > On Wed, Oct 29, 2014 at 10:02 AM, Niklas Wilcke
>> > <1wil...@informatik.uni-hamburg.de> wrote:
>> >> Hi Sean,
>> >>
>> >> thanks for your reply. The tests still don't work. I focused on the
>> >> mllib and core tests and made some observations.
>> >>
>> >> The core tests seems to fail because of my german locale. Some tests
>> >> are
>> >> locale dependend like the
>> >> UtilsSuite.scala
>> >>  - "string formatting of time durations" - checks for locale dependend
>> >> seperators like "." and ","
>> >>  - "isBindCollision" - checks for the locale dependend exception
>> >> message
>> >>
>> >> In the MLlib it seems to be just one source of failure. The same
>> >> Exception I described in my first mail appears several times in
>> >> different tests.
>> >> The reason for all the similar failures is the line 29 in
>> >> LocalClusterSparkContext.scala.
>> >> When I change the line
>> >> .setMaster("local-cluster[2, 1, 512]")
>> >> to
>> >> .setMaster("local")
>> >> all tests run without a failure. The local-cluster mode seems to be the
>> >> reason for the failure. I tried some different configurations like
>> >> [1,1,512], [2,1,1024] etc. but couldn't get the tests run without a
>> >> failure.
>> >>
>> >> Could this be a configuration issue?
>> >>
>> >> On 28.10.2014 19:03, Sean Owen wrote:
>> >>> On Tue, Oct 28, 2014 at 6:18 PM, Niklas Wilcke
>> >>> <1wil...@informatik.uni-hamburg.de> wrote:
>> >>>> 1. via dev/run-tests script
>> >>>> This script executes all tests and take several hours to finish.
>> >>>> Some tests failed but I can't say which of them. Should this really
>> >>>> take
>> >>>> that long? Can I specify to run only MLlib tests?
>> >>> Yes, running all tests takes a long long time. It does print which
>> >>> tests failed, and you can see the errors in the test output.
>> >>>
>> >>> Did you read
>> >>> http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
>> >>> ? This shows how to run just one test suite.
>> >>>
>> >>> In any Maven project you can try things like "mvn test -pl [module]"
>> >>> to run just one module's tests.
>> >> Yes I tried that as described below at point 2.
>> >>>> 2. directly via maven
>> >>>> I did the following described in the docs [0].
>> >>>>
>> >>>> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
>> >>>> -XX:ReservedCodeCacheSize=512m"
>> >>>> mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
>> >>>> mvn -Pyarn -Phadoop-2.3 -Phive test
>> >>>>
>> >>>&g

Re: How to run tests properly?

2014-10-29 Thread Patrick Wendell
One thing is you need to do a "maven package" before you run tests.
The "local-cluster" tests depend on Spark already being packaged.

- Patrick

On Wed, Oct 29, 2014 at 10:02 AM, Niklas Wilcke
<1wil...@informatik.uni-hamburg.de> wrote:
> Hi Sean,
>
> thanks for your reply. The tests still don't work. I focused on the
> mllib and core tests and made some observations.
>
> The core tests seems to fail because of my german locale. Some tests are
> locale dependend like the
> UtilsSuite.scala
>  - "string formatting of time durations" - checks for locale dependend
> seperators like "." and ","
>  - "isBindCollision" - checks for the locale dependend exception message
>
> In the MLlib it seems to be just one source of failure. The same
> Exception I described in my first mail appears several times in
> different tests.
> The reason for all the similar failures is the line 29 in
> LocalClusterSparkContext.scala.
> When I change the line
> .setMaster("local-cluster[2, 1, 512]")
> to
> .setMaster("local")
> all tests run without a failure. The local-cluster mode seems to be the
> reason for the failure. I tried some different configurations like
> [1,1,512], [2,1,1024] etc. but couldn't get the tests run without a failure.
>
> Could this be a configuration issue?
>
> On 28.10.2014 19:03, Sean Owen wrote:
>> On Tue, Oct 28, 2014 at 6:18 PM, Niklas Wilcke
>> <1wil...@informatik.uni-hamburg.de> wrote:
>>> 1. via dev/run-tests script
>>> This script executes all tests and take several hours to finish.
>>> Some tests failed but I can't say which of them. Should this really take
>>> that long? Can I specify to run only MLlib tests?
>> Yes, running all tests takes a long long time. It does print which
>> tests failed, and you can see the errors in the test output.
>>
>> Did you read 
>> http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
>> ? This shows how to run just one test suite.
>>
>> In any Maven project you can try things like "mvn test -pl [module]"
>> to run just one module's tests.
> Yes I tried that as described below at point 2.
>>> 2. directly via maven
>>> I did the following described in the docs [0].
>>>
>>> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
>>> -XX:ReservedCodeCacheSize=512m"
>>> mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
>>> mvn -Pyarn -Phadoop-2.3 -Phive test
>>>
>>> This also doesn't work.
>>> Why do I have to package spark bevore running the tests?
>> What doesn't work?
>> Some tests use the built assembly, which requires packaging.
> I get the same Exceptions as in every other way.
>>> 3. via sbt
>>> I tried the following. I freshly cloned spark and checked out the tag
>>> v1.1.0-rc4.
>>>
>>> sbt/sbt "project mllib" test
>>>
>>> and get the following exception in several cluster tests.
>>>
>>> [info] - task size should be small in both training and prediction ***
>>> FAILED ***
>> This just looks like a flaky test failure; I'd try again.
>>
> I don't think so. I tried for several times now in several different ways.
>
> Thanks,
> Niklas
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveShim not found when building in Intellij

2014-10-28 Thread Patrick Wendell
Oops - I actually should have added v0.13.0 (i.e. to match whatever I
did in the profile).

On Tue, Oct 28, 2014 at 10:05 PM, Patrick Wendell  wrote:
> Cheng - to make it recognize the new HiveShim for 0.12 I had to click
> on spark-hive under "packages" in the left pane, then go to "Open
> Module Settings" - then explicitly add the v0.12.0/src/main/scala
> folder to the sources by navigating to it and then +click to add
> it as a source. Did you have to do this?
>
> On Tue, Oct 28, 2014 at 9:57 PM, Patrick Wendell  wrote:
>> I just started a totally fresh IntelliJ project importing from our
>> root pom. I used all the default options and I added "hadoop-2.4,
>> hive, hive-0.13.1" profiles. I was able to run spark core tests from
>> within IntelliJ. Didn't try anything beyond that, but FWIW this
>> worked.
>>
>> - Patrick
>>
>> On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian  wrote:
>>> You may first open the root pom.xml file in IDEA, and then go for menu View
>>> / Tool Windows / Maven Projects, then choose desired Maven profile
>>> combination under the "Profiles" node (e.g. I usually use hadoop-2.4 + hive
>>> + hive-0.12.0). IDEA will ask you to re-import the Maven projects, confirm,
>>> then it should be OK.
>>>
>>> I can debug within IDEA with this approach. However, you have to clean the
>>> whole project before debugging Spark within IDEA if you compiled the project
>>> outside IDEA. Haven't got time to investigate this annoying issue.
>>>
>>> Also, you can remove sub projects unrelated to your tasks to accelerate
>>> compilation and/or avoid other IDEA build issues (e.g. Avro related Spark
>>> streaming build failure in IDEA).
>>>
>>>
>>> On 10/29/14 12:42 PM, Stephen Boesch wrote:
>>>
>>> I am interested specifically in how to build (and hopefully run/debug..)
>>> under Intellij.  Your posts sound like command line maven - which has always
>>> been working already.
>>>
>>> Do you have instructions for building in IJ?
>>>
>>> 2014-10-28 21:38 GMT-07:00 Cheng Lian :
>>>>
>>>> Yes, these two combinations work for me.
>>>>
>>>>
>>>> On 10/29/14 12:32 PM, Zhan Zhang wrote:
>>>>>
>>>>> -Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0" is to enable
>>>>> hive-0.12.0. Note that the thrift-server is not supported yet in 
>>>>> hive-0.13,
>>>>> but expected to go to upstream soon (Spark-3720).
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Zhan Zhang
>>>>>
>>>>>
>>>>>   On Oct 28, 2014, at 9:09 PM, Stephen Boesch  wrote:
>>>>>
>>>>>> Thanks Patrick for the heads up.
>>>>>>
>>>>>> I have not been successful to discover a combination of profiles (i.e.
>>>>>> enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
>>>>>> maven. Anyone who knows how to handle this - a quick note here would be
>>>>>> appreciated.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-10-28 20:20 GMT-07:00 Patrick Wendell :
>>>>>>
>>>>>>> Hey Stephen,
>>>>>>>
>>>>>>> In some cases in the maven build we now have pluggable source
>>>>>>> directories based on profiles using the maven build helper plug-in.
>>>>>>> This is necessary to support cross building against different Hive
>>>>>>> versions, and there will be additional instances of this due to
>>>>>>> supporting scala 2.11 and 2.10.
>>>>>>>
>>>>>>> In these cases, you may need to add source locations explicitly to
>>>>>>> intellij if you want the entire project to compile there.
>>>>>>>
>>>>>>> Unfortunately as long as we support cross-building like this, it will
>>>>>>> be an issue. Intellij's maven support does not correctly detect our
>>>>>>> use of the maven-build-plugin to add source directories.
>>>>>>>
>>>>>>> We should come up with a good set of instructions on how to import the
>>>>>>> pom files + add the few extra source directories. Off hand I am not
>>>>>>> sure exactly what the correct sequence is.

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Patrick Wendell
Cheng - to make it recognize the new HiveShim for 0.12 I had to click
on spark-hive under "packages" in the left pane, then go to "Open
Module Settings" - then explicitly add the v0.12.0/src/main/scala
folder to the sources by navigating to it and then +click to add
it as a source. Did you have to do this?

On Tue, Oct 28, 2014 at 9:57 PM, Patrick Wendell  wrote:
> I just started a totally fresh IntelliJ project importing from our
> root pom. I used all the default options and I added "hadoop-2.4,
> hive, hive-0.13.1" profiles. I was able to run spark core tests from
> within IntelliJ. Didn't try anything beyond that, but FWIW this
> worked.
>
> - Patrick
>
> On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian  wrote:
>> You may first open the root pom.xml file in IDEA, and then go for menu View
>> / Tool Windows / Maven Projects, then choose desired Maven profile
>> combination under the "Profiles" node (e.g. I usually use hadoop-2.4 + hive
>> + hive-0.12.0). IDEA will ask you to re-import the Maven projects, confirm,
>> then it should be OK.
>>
>> I can debug within IDEA with this approach. However, you have to clean the
>> whole project before debugging Spark within IDEA if you compiled the project
>> outside IDEA. Haven't got time to investigate this annoying issue.
>>
>> Also, you can remove sub projects unrelated to your tasks to accelerate
>> compilation and/or avoid other IDEA build issues (e.g. Avro related Spark
>> streaming build failure in IDEA).
>>
>>
>> On 10/29/14 12:42 PM, Stephen Boesch wrote:
>>
>> I am interested specifically in how to build (and hopefully run/debug..)
>> under Intellij.  Your posts sound like command line maven - which has always
>> been working already.
>>
>> Do you have instructions for building in IJ?
>>
>> 2014-10-28 21:38 GMT-07:00 Cheng Lian :
>>>
>>> Yes, these two combinations work for me.
>>>
>>>
>>> On 10/29/14 12:32 PM, Zhan Zhang wrote:
>>>>
>>>> -Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0" is to enable
>>>> hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
>>>> but expected to go to upstream soon (Spark-3720).
>>>>
>>>> Thanks.
>>>>
>>>> Zhan Zhang
>>>>
>>>>
>>>>   On Oct 28, 2014, at 9:09 PM, Stephen Boesch  wrote:
>>>>
>>>>> Thanks Patrick for the heads up.
>>>>>
>>>>> I have not been successful to discover a combination of profiles (i.e.
>>>>> enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
>>>>> maven. Anyone who knows how to handle this - a quick note here would be
>>>>> appreciated.
>>>>>
>>>>>
>>>>>
>>>>> 2014-10-28 20:20 GMT-07:00 Patrick Wendell :
>>>>>
>>>>>> Hey Stephen,
>>>>>>
>>>>>> In some cases in the maven build we now have pluggable source
>>>>>> directories based on profiles using the maven build helper plug-in.
>>>>>> This is necessary to support cross building against different Hive
>>>>>> versions, and there will be additional instances of this due to
>>>>>> supporting scala 2.11 and 2.10.
>>>>>>
>>>>>> In these cases, you may need to add source locations explicitly to
>>>>>> intellij if you want the entire project to compile there.
>>>>>>
>>>>>> Unfortunately as long as we support cross-building like this, it will
>>>>>> be an issue. Intellij's maven support does not correctly detect our
>>>>>> use of the maven-build-plugin to add source directories.
>>>>>>
>>>>>> We should come up with a good set of instructions on how to import the
>>>>>> pom files + add the few extra source directories. Off hand I am not
>>>>>> sure exactly what the correct sequence is.
>>>>>>
>>>>>> - Patrick
>>>>>>
>>>>>> On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch 
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Matei,
>>>>>>>   Until my latest pull from upstream/master it had not been necessary
>>>>>>> to
>>>>>>> add the hive profile: is it now??
>>>>>>>
>>>>>>> I am not usin

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Patrick Wendell
I just started a totally fresh IntelliJ project importing from our
root pom. I used all the default options and I added "hadoop-2.4,
hive, hive-0.13.1" profiles. I was able to run spark core tests from
within IntelliJ. Didn't try anything beyond that, but FWIW this
worked.

- Patrick

On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian  wrote:
> You may first open the root pom.xml file in IDEA, and then go for menu View
> / Tool Windows / Maven Projects, then choose desired Maven profile
> combination under the "Profiles" node (e.g. I usually use hadoop-2.4 + hive
> + hive-0.12.0). IDEA will ask you to re-import the Maven projects, confirm,
> then it should be OK.
>
> I can debug within IDEA with this approach. However, you have to clean the
> whole project before debugging Spark within IDEA if you compiled the project
> outside IDEA. Haven't got time to investigate this annoying issue.
>
> Also, you can remove sub projects unrelated to your tasks to accelerate
> compilation and/or avoid other IDEA build issues (e.g. Avro related Spark
> streaming build failure in IDEA).
>
>
> On 10/29/14 12:42 PM, Stephen Boesch wrote:
>
> I am interested specifically in how to build (and hopefully run/debug..)
> under Intellij.  Your posts sound like command line maven - which has always
> been working already.
>
> Do you have instructions for building in IJ?
>
> 2014-10-28 21:38 GMT-07:00 Cheng Lian :
>>
>> Yes, these two combinations work for me.
>>
>>
>> On 10/29/14 12:32 PM, Zhan Zhang wrote:
>>>
>>> -Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0" is to enable
>>> hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
>>> but expected to go to upstream soon (Spark-3720).
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>>
>>>   On Oct 28, 2014, at 9:09 PM, Stephen Boesch  wrote:
>>>
>>>> Thanks Patrick for the heads up.
>>>>
>>>> I have not been successful to discover a combination of profiles (i.e.
>>>> enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
>>>> maven. Anyone who knows how to handle this - a quick note here would be
>>>> appreciated.
>>>>
>>>>
>>>>
>>>> 2014-10-28 20:20 GMT-07:00 Patrick Wendell :
>>>>
>>>>> Hey Stephen,
>>>>>
>>>>> In some cases in the maven build we now have pluggable source
>>>>> directories based on profiles using the maven build helper plug-in.
>>>>> This is necessary to support cross building against different Hive
>>>>> versions, and there will be additional instances of this due to
>>>>> supporting scala 2.11 and 2.10.
>>>>>
>>>>> In these cases, you may need to add source locations explicitly to
>>>>> intellij if you want the entire project to compile there.
>>>>>
>>>>> Unfortunately as long as we support cross-building like this, it will
>>>>> be an issue. Intellij's maven support does not correctly detect our
>>>>> use of the maven-build-plugin to add source directories.
>>>>>
>>>>> We should come up with a good set of instructions on how to import the
>>>>> pom files + add the few extra source directories. Off hand I am not
>>>>> sure exactly what the correct sequence is.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch 
>>>>> wrote:
>>>>>>
>>>>>> Hi Matei,
>>>>>>   Until my latest pull from upstream/master it had not been necessary
>>>>>> to
>>>>>> add the hive profile: is it now??
>>>>>>
>>>>>> I am not using sbt gen-idea. The way to open in intellij has been to
>>>>>> Open
>>>>>> the parent directory. IJ recognizes it as a maven project.
>>>>>>
>>>>>> There are several steps to do surgery on the yarn-parent / yarn
>>>>>> projects
>>>>>
>>>>> ,
>>>>>>
>>>>>> then do a full rebuild.  That was working until one week ago.
>>>>>> Intellij/maven is presently broken in  two ways:  this hive shim
>>>>>> (which
>>>>>
>>>>> may
>>>>>>
>>

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Patrick Wendell
Btw - we should have part of the official docs that describes a full
"from scratch" build in IntelliJ including any gotchas. Then we can
update it if there are build changes that alter it. I created this
JIRA for it:

https://issues.apache.org/jira/browse/SPARK-4128

On Tue, Oct 28, 2014 at 9:42 PM, Stephen Boesch  wrote:
> I am interested specifically in how to build (and hopefully run/debug..)
> under Intellij.  Your posts sound like command line maven - which has always
> been working already.
>
> Do you have instructions for building in IJ?
>
> 2014-10-28 21:38 GMT-07:00 Cheng Lian :
>
>> Yes, these two combinations work for me.
>>
>>
>> On 10/29/14 12:32 PM, Zhan Zhang wrote:
>>>
>>> -Phive is to enable hive-0.13.1 and "-Phive -Phive-0.12.0" is to enable
>>> hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
>>> but expected to go to upstream soon (Spark-3720).
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>>
>>>   On Oct 28, 2014, at 9:09 PM, Stephen Boesch  wrote:
>>>
>>>> Thanks Patrick for the heads up.
>>>>
>>>> I have not been successful to discover a combination of profiles (i.e.
>>>> enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
>>>> maven. Anyone who knows how to handle this - a quick note here would be
>>>> appreciated.
>>>>
>>>>
>>>>
>>>> 2014-10-28 20:20 GMT-07:00 Patrick Wendell :
>>>>
>>>>> Hey Stephen,
>>>>>
>>>>> In some cases in the maven build we now have pluggable source
>>>>> directories based on profiles using the maven build helper plug-in.
>>>>> This is necessary to support cross building against different Hive
>>>>> versions, and there will be additional instances of this due to
>>>>> supporting scala 2.11 and 2.10.
>>>>>
>>>>> In these cases, you may need to add source locations explicitly to
>>>>> intellij if you want the entire project to compile there.
>>>>>
>>>>> Unfortunately as long as we support cross-building like this, it will
>>>>> be an issue. Intellij's maven support does not correctly detect our
>>>>> use of the maven-build-plugin to add source directories.
>>>>>
>>>>> We should come up with a good set of instructions on how to import the
>>>>> pom files + add the few extra source directories. Off hand I am not
>>>>> sure exactly what the correct sequence is.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch 
>>>>> wrote:
>>>>>>
>>>>>> Hi Matei,
>>>>>>   Until my latest pull from upstream/master it had not been necessary
>>>>>> to
>>>>>> add the hive profile: is it now??
>>>>>>
>>>>>> I am not using sbt gen-idea. The way to open in intellij has been to
>>>>>> Open
>>>>>> the parent directory. IJ recognizes it as a maven project.
>>>>>>
>>>>>> There are several steps to do surgery on the yarn-parent / yarn
>>>>>> projects
>>>>>
>>>>> ,
>>>>>>
>>>>>> then do a full rebuild.  That was working until one week ago.
>>>>>> Intellij/maven is presently broken in  two ways:  this hive shim
>>>>>> (which
>>>>>
>>>>> may
>>>>>>
>>>>>> yet hopefully be a small/simple fix - let us see) and  (2) the
>>>>>> "NoClassDefFoundError
>>>>>> on ThreadFactoryBuilder" from my prior emails -and which is quite a
>>>>>
>>>>> serious
>>>>>>
>>>>>> problem .
>>>>>>
>>>>>> 2014-10-28 19:46 GMT-07:00 Matei Zaharia :
>>>>>>
>>>>>>> Hi Stephen,
>>>>>>>
>>>>>>> How did you generate your Maven workspace? You need to make sure the
>>>>>
>>>>> Hive
>>>>>>>
>>>>>>> profile is enabled for it. For example sbt/sbt -Phive gen-idea.
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>>> On Oct 28, 2014, at 7:42 PM, Stephen Boesch 
>>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> I have run on the command line via maven and it is fine:
>>>>>>>>
>>>>>>>> mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn
>>>>>
>>>>> -Phadoop-2.3
>>>>>>>>
>>>>>>>> compile package install
>>>>>>>>
>>>>>>>>
>>>>>>>> But with the latest code Intellij builds do not work. Following is
>>>>>
>>>>> one of
>>>>>>>>
>>>>>>>> 26 similar errors:
>>>>>>>>
>>>>>>>>
>>>>>>>> Error:(173, 38) not found: value HiveShim
>>>>>>>>
>>>>>>> Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))
>>>>>>>>
>>>>>>>> ^
>>>>>>>
>>>>>>>
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: HiveShim not found when building in Intellij

2014-10-28 Thread Patrick Wendell
Hey Stephen,

In some cases in the maven build we now have pluggable source
directories based on profiles using the maven build helper plug-in.
This is necessary to support cross building against different Hive
versions, and there will be additional instances of this due to
supporting scala 2.11 and 2.10.

In these cases, you may need to add source locations explicitly to
intellij if you want the entire project to compile there.

Unfortunately as long as we support cross-building like this, it will
be an issue. Intellij's maven support does not correctly detect our
use of the maven-build-plugin to add source directories.

We should come up with a good set of instructions on how to import the
pom files + add the few extra source directories. Off hand I am not
sure exactly what the correct sequence is.

- Patrick

On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch  wrote:
> Hi Matei,
>   Until my latest pull from upstream/master it had not been necessary to
> add the hive profile: is it now??
>
> I am not using sbt gen-idea. The way to open in intellij has been to Open
> the parent directory. IJ recognizes it as a maven project.
>
> There are several steps to do surgery on the yarn-parent / yarn projects ,
> then do a full rebuild.  That was working until one week ago.
> Intellij/maven is presently broken in  two ways:  this hive shim (which may
> yet hopefully be a small/simple fix - let us see) and  (2) the
> "NoClassDefFoundError
> on ThreadFactoryBuilder" from my prior emails -and which is quite a serious
>  problem .
>
> 2014-10-28 19:46 GMT-07:00 Matei Zaharia :
>
>> Hi Stephen,
>>
>> How did you generate your Maven workspace? You need to make sure the Hive
>> profile is enabled for it. For example sbt/sbt -Phive gen-idea.
>>
>> Matei
>>
>> > On Oct 28, 2014, at 7:42 PM, Stephen Boesch  wrote:
>> >
>> > I have run on the command line via maven and it is fine:
>> >
>> > mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn -Phadoop-2.3
>> > compile package install
>> >
>> >
>> > But with the latest code Intellij builds do not work. Following is one of
>> > 26 similar errors:
>> >
>> >
>> > Error:(173, 38) not found: value HiveShim
>> >
>> Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))
>> > ^
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Support Hive 0.13 .1 in Spark SQL

2014-10-27 Thread Patrick Wendell
Hey Cheng,

Right now we aren't using stable API's to communicate with the Hive
Metastore. We didn't want to drop support for Hive 0.12 so right now
we are using a shim layer to support compiling for 0.12 and 0.13. This
is very costly to maintain.

If Hive has a stable meta-data API for talking to a Metastore, we
should use that (is HCatalog sufficient for this purpose?). Ideally we
would be able to talk to multiple versions of the Hive metastore and
we can keep a single internal version of Hive for our use of Serde's,
etc.

I've created SPARK-4114 for this:
https://issues.apache.org/jira/browse/SPARK-4114

This is a very important issue for Spark SQL, so I'd welcome comments
on that JIRA from anyone who is familiar with Hive/HCatalog internals.

- Patrick

On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao  wrote:
> Hi, all
>
>I have some PRs blocked by hive upgrading (e.g.
> https://github.com/apache/spark/pull/2570), the problem is some internal
> hive method signature changed, it's hard to make the compatible in code
> level (sql/hive) when switching back/forth the Hive versions.
>
>
>
>   I guess the motivation of the upgrading is to support the Metastore with
> different Hive versions. So, how about just keep the metastore related hive
> jars upgrading or utilize the HCatalog directly? And of course we can either
> leaving hive-exec.jar hive-cli.jar etc as 0.12 or upgrade to 0.13.1, but not
> support them both.
>
>
>
> Sorry if I missed some discussion of Hive upgrading.
>
>
>
> Cheng Hao

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Moving PR Builder to mvn

2014-10-24 Thread Patrick Wendell
Does Zinc still help if you are just running a single totally fresh
build? For the pull request builder we purge all state from previous
builds.

- Patrick

On Fri, Oct 24, 2014 at 1:55 PM, Hari Shreedharan
 wrote:
> I have zinc server running on my mac, and I see maven compilation to be much
> better than before I had it running. Is the sbt build still faster (sorry,
> long time since I did a build with sbt).
>
> Thanks,
> Hari
>
>
> On Fri, Oct 24, 2014 at 1:46 PM, Patrick Wendell  wrote:
>>
>> Overall I think this would be a good idea. The main blocker is just
>> that I think the Maven build is much slower right now than the SBT
>> build. However, if we were able to e.g. parallelize the test build on
>> Jenkins that might make up for it.
>>
>> I'd actually like to have a trigger where we could tests pull requests
>> with either one.
>>
>> - Patrick
>>
>> On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan
>>  wrote:
>> > Over the last few months, it seems like we have selected Maven to be the
>> > "official" build system for Spark.
>> >
>> >
>> > I realize that removing the sbt build may not be easy, but it might be a
>> > good idea to start looking into that. We had issues over the past few days
>> > where mvn builds were fine, while sbt was failing to resolve dependencies
>> > which were test-jars causing compilation of certain tests to fail.
>> >
>> >
>> > As a first step, I am wondering if it might be a good idea to change the
>> > PR builder to mvn and test PRs consistent with the way we test releases. I
>> > am not sure how technically feasible it is, but it would be a start to
>> > standardizing on one build system.
>> >
>> > Thanks,
>> > Hari
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: your weekly git timeout update! TL;DR: i'm now almost certain we're not hitting rate limits.

2014-10-24 Thread Patrick Wendell
Thanks for the update Shane.

As a point of process, for things like this where we re debugging
specific issues - can we use JIRA instead of notifying everyone on the
spark-dev list?

I'd prefer if ops/infra announcements on the dev list are restricted
to things that are widely applicable to developers (e.g. planned or
unplanned maintenance on jenkins), since this last has hundreds of
people on it.

- Patrick

On Fri, Oct 24, 2014 at 1:32 PM, shane knapp  wrote:
> so, things look like they've stabilized significantly over the past 10 days,
> and without any changes on our end:
> 
> $ /root/tools/get_timeouts.sh 10
> timeouts by date:
> 2014-10-14 -- 2
> 2014-10-16 -- 1
> 2014-10-19 -- 1
> 2014-10-20 -- 2
> 2014-10-23 -- 5
>
> timeouts by project:
>   5 NewSparkPullRequestBuilder
>   5 SparkPullRequestBuilder
>   1 Tachyon-Pull-Request-Builder
> total builds (excepting aborted by a user):
> 602
>
> total percentage of builds timing out:
> 01
> 
>
> the NewSparkPullRequestBuilder failures are spread over five different days
> (10-14 through 10-20), and the SparkPullRequestBuilder failures all happened
> yesterday.  there were a LOT of SparkPullRequestBuilder builds yesterday
> (60), and the failures happened during these hours (first number == number
> of builds failed, second number == hour of the day):
> 
> $ cat timeouts-102414-130817 | grep SparkPullRequestBuilder | grep
> 2014-10-23 | awk '{print$3}' | awk -F":" '{print$1'} | sort | uniq -c
>   1 03
>   2 20
>   1 22
>   1 23
> 
>
> however, the number of total SparkPullRequestBuilder builds during these
> times don't seem egregious:
> 
>   4 03
>   9 20
>   4 22
>   9 23
> 
>
> nor does the total for ALL builds at those times:
> 
>   5 03
>   9 20
>   7 22
>  11 23
> 
>
> 9 builds was the largest number of SparkPullRequestBuilder builds per hour,
> but there were other hours with 5, 6 or 7 builds/hour that didn't have a
> timeout issue.
>
> in fact, hour 16 (4pm) had the most builds running total yesterday, which
> includes 7 SparkPullRequestBuilder builds, and nothing timed out.
>
> most of the pull request builder hits on github are authenticated w/an oauth
> token.  this gives us 5000 hits/hour, and unauthed gives us 60/hour.
>
> in conclusion:  there is no way are we hitting github often enough to be
> rate limited.  i think i've finally ruled that out completely.  :)
>
> --
> You received this message because you are subscribed to the Google Groups
> "amp-infra" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to amp-infra+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Moving PR Builder to mvn

2014-10-24 Thread Patrick Wendell
Overall I think this would be a good idea. The main blocker is just
that I think the Maven build is much slower right now than the SBT
build. However, if we were able to e.g. parallelize the test build on
Jenkins that might make up for it.

I'd actually like to have a trigger where we could tests pull requests
with either one.

- Patrick

On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan
 wrote:
> Over the last few months, it seems like we have selected Maven to be the 
> "official" build system for Spark.
>
>
> I realize that removing the sbt build may not be easy, but it might be a good 
> idea to start looking into that. We had issues over the past few days where 
> mvn builds were fine, while sbt was failing to resolve dependencies which 
> were test-jars causing compilation of certain tests to fail.
>
>
> As a first step, I am wondering if it might be a good idea to change the PR 
> builder to mvn and test PRs consistent with the way we test releases. I am 
> not sure how technically feasible it is, but it would be a start to 
> standardizing on one build system.
>
> Thanks,
> Hari

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.2 feature freeze on November 1

2014-10-23 Thread Patrick Wendell
Hey All,

Just a reminder that as planned [1] we'll go into a feature freeze on
November 1. On that date I'll cut a 1.2 release branch and make the
up-or-down call on any patches that go into that branch, along with
individual committers.

It is common for us to receive a very large volume of patches near the
deadline. The highest priority will be fixes and features that are in
review and were submitted earlier in the window. As a heads up, new
feature patches that are submitted in the next week have a good chance
of being pushed after 1.2.

During this coming weeks, I'd like to invite the community to help
with code review, testing patches, helping isolate bugs, our test
infra, etc. In past releases, community participation has helped
increase our ability to merge patches substantially. Individuals
really can make a huge difference here!

[1] https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: scalastyle annoys me a little bit

2014-10-23 Thread Patrick Wendell
Hey Koert,

I think disabling the style checks in maven package could be a good
idea for the reason you point out. I was sort of mixed on that when it
was proposed for this exact reason. It's just annoying to developers.

In terms of changing the global limit, this is more religion than
anything else, but there are other cases where the current limit is
useful (e.g. if you have many windows open in a large screen).

- Patrick

On Thu, Oct 23, 2014 at 11:03 AM, Koert Kuipers  wrote:
> 100 max width seems very restrictive to me.
>
> even the most restrictive environment i have for development (ssh with
> emacs) i get a lot more characters to work with than that.
>
> personally i find the code harder to read, not easier. like i kept
> wondering why there are weird newlines in the
> middle of constructors and such, only to realise later it was because of
> the 100 character limit.
>
> also, i find "mvn package" erroring out because of style errors somewhat
> excessive. i understand that a pull request needs to conform to "the style"
> before being accepted, but this means i cant even run tests on code that
> does not conform to the style guide, which is a bit silly.
>
> i keep going out for coffee while package and tests run, only to come back
> for an annoying error that my line is 101 characters and therefore nothing
> ran.
>
> is there some maven switch to disable the style checks?
>
> best! koert

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Which part of the code deals with communication?

2014-10-22 Thread Patrick Wendell
The best documentation about communication interfaces is the
SecurityManager doc written by Tom Graves. With this as a starting
point I'd recommend digging through the code for each component.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SecurityManager.scala#L59

On Wed, Oct 22, 2014 at 4:00 AM, Theodore Si  wrote:
> Hi all,
>
> Workers will exchange data in between, right?
>
> What classes are in charge of these actions?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Patrick Wendell
Josh - the errors that broke our build indicated that JDK5 was being
used. Somehow the upgrade caused our build to use a much older Java
version. See the JIRA for more details.

On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen  wrote:
> I find it concerning that there's a JDK version that breaks out build, since
> we're supposed to support Java 7.  Is 7u71 an upgrade or downgrade from the
> JDK that we used before?  Is there an easy way to fix our build so that it
> compiles with 7u71's stricter settings?
>
> I'm not sure why the "New" PRB is failing here.  It was originally created
> as a clone of the main pull request builder job. I checked the configuration
> history and confirmed that there aren't any settings that we've forgotten to
> copy over (e.g. their configurations haven't diverged), so I'm not sure
> what's causing this.
>
> - Josh
>
> On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com) wrote:
>
> weird.two buildings (one triggered by New, one triggered by Old) were
> executed in the same node, amp-jenkins-slave-01, one compiles, one not...
>
> Best,
>
> --
> Nan Zhu
>
>
> On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:
>
>> seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while
>> SparkPRBuilder is working fine
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>>
>> On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
>>
>> > It's a new pull request builder written by Josh, integrated into our
>> > state-of-the-art PR dashboard :)
>> >
>> > On 10/21/14 9:33 PM, Nan Zhu wrote:
>> > > just curious...what is this "NewSparkPullRequestBuilder"?
>> > >
>> > > Best,
>> > >
>> > > --
>> > > Nan Zhu
>> > >
>> > >
>> > > On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
>> > >
>> > > >
>> > > > Hm, seems that 7u71 comes back again. Observed similar Kinesis
>> > > > compilation error just now:
>> > > > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
>> > > >
>> > > >
>> > > > Checked Jenkins slave nodes, saw /usr/java/latest points to
>> > > > jdk1.7.0_71. However, /usr/bin/javac -version says:
>> > > >
>> > > > >
>> > > > > Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM
>> > > > > Corp 2000, 2008. All rights reserved.
>> > > > >
>> > > >
>> > > >
>> > > > Which JDK is actually used by Jenkins?
>> > > >
>> > > >
>> > > > Cheng
>> > > >
>> > > >
>> > > > On 10/21/14 8:28 AM, shane knapp wrote:
>> > > >
>> > > > > ok, so earlier today i installed a 2nd JDK within jenkins (7u71),
>> > > > > which fixed the SparkR build but apparently made Spark itself quite 
>> > > > > unhappy.
>> > > > > i removed that JDK, triggered a build (
>> > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
>> > > > > and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
>> > > > > stricter
>> > > > > when compiling. sad times. sorry about that! shane On Mon, Oct 20, 
>> > > > > 2014 at
>> > > > > 5:16 PM, Patrick Wendell  
>> > > > > (mailto:pwend...@gmail.com)
>> > > > > wrote:
>> > > > > > The failure is in the Kinesis compoent, can you reproduce this
>> > > > > > if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 
>> > > > > > 5:08 PM,
>> > > > > > shane knapp  (mailto:skn...@berkeley.edu) 
>> > > > > > wrote:
>> > > > > > > hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11
>> > > > > > > PM, Nan Zhu  
>> > > > > > > (mailto:zhunanmcg...@gmail.com) wrote:
>> > > > > > > > yes, I can compile locally, too but it seems that Jenkins is
>> > > > > > > > not happy now...
>> > > > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
>> > > > > > > >  All
>> > > > > > > > failed to compile Best, -- Nan Z

Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Patrick Wendell
I created an issue to fix this:

https://issues.apache.org/jira/browse/SPARK-4021

On Mon, Oct 20, 2014 at 5:32 PM, Patrick Wendell  wrote:
> Thanks Shane - we should fix the source code issues in the Kinesis
> code that made stricter Java compilers reject it.
>
> - Patrick
>
> On Mon, Oct 20, 2014 at 5:28 PM, shane knapp  wrote:
>> ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
>> fixed the SparkR build but apparently made Spark itself quite unhappy.  i
>> removed that JDK, triggered a build
>> (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
>> and it compiled kinesis w/o dying a fiery death.
>>
>> apparently 7u71 is stricter when compiling.  sad times.
>>
>> sorry about that!
>>
>> shane
>>
>>
>> On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell  wrote:
>>>
>>> The failure is in the Kinesis compoent, can you reproduce this if you
>>> build with -Pkinesis-asl?
>>>
>>> - Patrick
>>>
>>> On Mon, Oct 20, 2014 at 5:08 PM, shane knapp  wrote:
>>> > hmm, strange.  i'll take a look.
>>> >
>>> > On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu  wrote:
>>> >
>>> >> yes, I can compile locally, too
>>> >>
>>> >> but it seems that Jenkins is not happy now...
>>> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
>>> >>
>>> >> All failed to compile
>>> >>
>>> >> Best,
>>> >>
>>> >> --
>>> >> Nan Zhu
>>> >>
>>> >>
>>> >> On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:
>>> >>
>>> >> > I performed build on latest master branch but didn't get compilation
>>> >> error.
>>> >> >
>>> >> > FYI
>>> >> >
>>> >> > On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu >> >> (mailto:zhunanmcg...@gmail.com)> wrote:
>>> >> > > Hi,
>>> >> > >
>>> >> > > I just submitted a patch
>>> >> https://github.com/apache/spark/pull/2864/files
>>> >> > > with one line change
>>> >> > >
>>> >> > > but the Jenkins told me it's failed to compile on the unrelated
>>> >> > > files?
>>> >> > >
>>> >> > >
>>> >>
>>> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
>>> >> > >
>>> >> > >
>>> >> > > Best,
>>> >> > >
>>> >> > > Nan
>>> >> >
>>> >>
>>> >>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Patrick Wendell
Thanks Shane - we should fix the source code issues in the Kinesis
code that made stricter Java compilers reject it.

- Patrick

On Mon, Oct 20, 2014 at 5:28 PM, shane knapp  wrote:
> ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which
> fixed the SparkR build but apparently made Spark itself quite unhappy.  i
> removed that JDK, triggered a build
> (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
> and it compiled kinesis w/o dying a fiery death.
>
> apparently 7u71 is stricter when compiling.  sad times.
>
> sorry about that!
>
> shane
>
>
> On Mon, Oct 20, 2014 at 5:16 PM, Patrick Wendell  wrote:
>>
>> The failure is in the Kinesis compoent, can you reproduce this if you
>> build with -Pkinesis-asl?
>>
>> - Patrick
>>
>> On Mon, Oct 20, 2014 at 5:08 PM, shane knapp  wrote:
>> > hmm, strange.  i'll take a look.
>> >
>> > On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu  wrote:
>> >
>> >> yes, I can compile locally, too
>> >>
>> >> but it seems that Jenkins is not happy now...
>> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
>> >>
>> >> All failed to compile
>> >>
>> >> Best,
>> >>
>> >> --
>> >> Nan Zhu
>> >>
>> >>
>> >> On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:
>> >>
>> >> > I performed build on latest master branch but didn't get compilation
>> >> error.
>> >> >
>> >> > FYI
>> >> >
>> >> > On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu > >> (mailto:zhunanmcg...@gmail.com)> wrote:
>> >> > > Hi,
>> >> > >
>> >> > > I just submitted a patch
>> >> https://github.com/apache/spark/pull/2864/files
>> >> > > with one line change
>> >> > >
>> >> > > but the Jenkins told me it's failed to compile on the unrelated
>> >> > > files?
>> >> > >
>> >> > >
>> >>
>> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
>> >> > >
>> >> > >
>> >> > > Best,
>> >> > >
>> >> > > Nan
>> >> >
>> >>
>> >>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Patrick Wendell
The failure is in the Kinesis compoent, can you reproduce this if you
build with -Pkinesis-asl?

- Patrick

On Mon, Oct 20, 2014 at 5:08 PM, shane knapp  wrote:
> hmm, strange.  i'll take a look.
>
> On Mon, Oct 20, 2014 at 5:11 PM, Nan Zhu  wrote:
>
>> yes, I can compile locally, too
>>
>> but it seems that Jenkins is not happy now...
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
>>
>> All failed to compile
>>
>> Best,
>>
>> --
>> Nan Zhu
>>
>>
>> On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:
>>
>> > I performed build on latest master branch but didn't get compilation
>> error.
>> >
>> > FYI
>> >
>> > On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu > (mailto:zhunanmcg...@gmail.com)> wrote:
>> > > Hi,
>> > >
>> > > I just submitted a patch
>> https://github.com/apache/spark/pull/2864/files
>> > > with one line change
>> > >
>> > > but the Jenkins told me it's failed to compile on the unrelated files?
>> > >
>> > >
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
>> > >
>> > >
>> > > Best,
>> > >
>> > > Nan
>> >
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Get attempt number in a closure

2014-10-20 Thread Patrick Wendell
There is a deeper issue here which is AFAIK we don't even store a
notion of attempt inside of Spark, we just use a new taskId with the
same index.

On Mon, Oct 20, 2014 at 12:38 PM, Yin Huai  wrote:
> Yeah, seems we need to pass the attempt id to executors through
> TaskDescription. I have created
> https://issues.apache.org/jira/browse/SPARK-4014.
>
> On Mon, Oct 20, 2014 at 1:57 PM, Reynold Xin  wrote:
>
>> I also ran into this earlier. It is a bug. Do you want to file a jira?
>>
>> I think part of the problem is that we don't actually have the attempt id
>> on the executors. If we do, that's great. If not, we'd need to propagate
>> that over.
>>
>> On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai  wrote:
>>
>>> Hello,
>>>
>>> Is there any way to get the attempt number in a closure? Seems
>>> TaskContext.attemptId actually returns the taskId of a task (see this
>>> <
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L181
>>> >
>>>  and this
>>> <
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L47
>>> >).
>>> It looks like a bug.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Scalastyle improvements / large code reformatting

2014-10-13 Thread Patrick Wendell
Hey Nick,

I think the best solution is really to find a way to only apply
certain rules to code modified after a certain date. I also don't
think it would be that hard to implement because git can output
per-line information about modification times. So you'd just run the
scalastyle rules and then if you saw errors from rules with a special
"if modified since" property, we'd only fail the line has been
modified after that date. That would even work for imports as well,
you'd just have a thing where if anyone modified some imports they
would have to fix all the imports in that file. It's at least worth a
try.

Overall I think that's the only real solution here. Doing things
closer to releases reduces the overhead of backporting, but overall
it's still going to be a very high overhead.

- Patrick

On Mon, Oct 13, 2014 at 5:46 AM, Nicholas Chammas
 wrote:
> The arguments against large scale refactorings make sense. Doing them, if at
> all, during QA cycles or around releases sounds like a promising idea.
>
> Coupled with that, would it be useful to implement new rules outside of
> these potential windows for refactoring in such a way that they report on
> style errors without failing the build?
>
> That way they work like a nag and encourage developers to fix style problems
> in the course of working on their original patch. Coupled with a clear
> policy to fix style only where code is being changed anyway, this could be a
> helpful way to steadily fix problems related to new rules over time. Then,
> when we get close to the "finish line", we can make a final patch to fix the
> remaining issues for a given rule and enforce it as part of the build.
> Having the style report should also make it easier for committers to review
> style. We will have to do some work to show reports on new rules in a
> digestible way, as they will probably be very large at first, but I think
> that's a tractable problem.
>
> What do committers/reviewers think of that?
>
> As an aside, enforcing new style rules for new files only is an interesting
> idea, but you'd have to track that a file was added after the new rules were
> enforced. Otherwise bad style will be allowed after the initial checkin of
> that file. Also, enforcing style rules on changes may work in some (e.g.
> space required  before "{") but is impossible in other (import ordering)
> cases, as Reynold pointed out.
>
> Nick
>
>
> 2014년 10월 13일 월요일, Matei Zaharia님이 작성한 메시지:
>
>> I'm also against these huge reformattings. They slow down development and
>> backporting for trivial reasons. Let's not do that at this point, the style
>> of the current code is quite consistent and we have plenty of other things
>> to worry about. Instead, what you can do is as you edit a file when you're
>> working on a feature, fix up style issues you see. Or, as Josh suggested,
>> some way to make this apply only to new files would help.
>>
>> Matei
>>
>> On Oct 12, 2014, at 10:16 PM, Patrick Wendell  wrote:
>>
>> > Another big problem with these patches are that they make it almost
>> > impossible to backport changes to older branches cleanly (there
>> > becomes like 100% chance of a merge conflict).
>> >
>> > One proposal is to do this:
>> > 1. We only consider new style rules at the end of a release cycle,
>> > when there is the smallest chance of wanting to backport stuff.
>> > 2. We require that they are submitted in individual patches with a (a)
>> > new style rule and (b) the associated changes. Then we can also
>> > evaluate on a case-by-case basis how large the change is for each
>> > rule. For rules that require sweeping changes across the codebase,
>> > personally I'd vote against them. For rules like import ordering that
>> > won't cause too much pain on the diff (it's pretty easy to deal with
>> > those conflicts) I'd be okay with it.
>> >
>> > If we went with this, we'd also have to warn people that we might not
>> > accept new style rules if they are too costly to enforce. I'm guessing
>> > people will still contribute even with those expectations.
>> >
>> > - Patrick
>> >
>> > On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin 
>> > wrote:
>> >> I actually think we should just take the bite and follow through with
>> >> the
>> >> reformatting. Many rules are simply not possible to enforce only on
>> >> deltas
>> >> (e.g. import ordering).
>> >>
>> >> That said, maybe there are better windows to do this, e.g. 

Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Patrick Wendell
Another big problem with these patches are that they make it almost
impossible to backport changes to older branches cleanly (there
becomes like 100% chance of a merge conflict).

One proposal is to do this:
1. We only consider new style rules at the end of a release cycle,
when there is the smallest chance of wanting to backport stuff.
2. We require that they are submitted in individual patches with a (a)
new style rule and (b) the associated changes. Then we can also
evaluate on a case-by-case basis how large the change is for each
rule. For rules that require sweeping changes across the codebase,
personally I'd vote against them. For rules like import ordering that
won't cause too much pain on the diff (it's pretty easy to deal with
those conflicts) I'd be okay with it.

If we went with this, we'd also have to warn people that we might not
accept new style rules if they are too costly to enforce. I'm guessing
people will still contribute even with those expectations.

- Patrick

On Sun, Oct 12, 2014 at 9:39 PM, Reynold Xin  wrote:
> I actually think we should just take the bite and follow through with the
> reformatting. Many rules are simply not possible to enforce only on deltas
> (e.g. import ordering).
>
> That said, maybe there are better windows to do this, e.g. during the QA
> period.
>
> On Sun, Oct 12, 2014 at 9:37 PM, Josh Rosen  wrote:
>
>> There are a number of open pull requests that aim to extend Spark's
>> automated style checks (see
>> https://issues.apache.org/jira/browse/SPARK-3849 for an umbrella JIRA).
>> These fixes are mostly good, but I have some concerns about merging these
>> patches.  Several of these patches make large reformatting changes in
>> nearly every file of Spark, which makes it more difficult to use `git
>> blame` and has the potential to introduce merge conflicts with all open PRs
>> and all backport patches.
>>
>> I feel that most of the value of automated style-checks comes from
>> allowing reviewers/committers to focus on the technical content of pull
>> requests rather than their formatting.  My concern is that the convenience
>> added by these new style rules will not outweigh the other overheads that
>> these reformatting patches will create for the committers.
>>
>> If possible, it would be great if we could extend the style checker to
>> enforce the more stringent rules only for new code additions / deletions.
>> If not, I don't think that we should proceed with the reformatting.  Others
>> might disagree, though, so I welcome comments / discussion.
>>
>> - Josh

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Unneeded branches/tags

2014-10-07 Thread Patrick Wendell
Actually - weirdly - we can delete old tags and it works with the
mirroring. Nick if you put together a list of un-needed tags I can
delete them.

On Tue, Oct 7, 2014 at 6:27 PM, Reynold Xin  wrote:
> Those branches are no longer active. However, I don't think we can delete
> branches from github due to the way ASF mirroring works. I might be wrong
> there.
>
>
>
> On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas > wrote:
>
>> Just curious: Are there branches and/or tags on the repo that we don't need
>> anymore?
>>
>> What are the scala-2.9 and streaming branches for, for example? And do we
>> still need branches for older versions of Spark that we are not backporting
>> stuff to, like branch-0.5?
>>
>> Nick
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: EC2 clusters ready in launch time + 30 seconds

2014-10-03 Thread Patrick Wendell
Hey All,

Just a couple notes. I recently posted a shell script for creating the
AMI's from a clean Amazon Linux AMI.

https://github.com/mesos/spark-ec2/blob/v3/create_image.sh

I think I will update the AMI's soon to get the most recent security
updates. For spark-ec2's purpose this is probably sufficient (we'll
only need to re-create them every few months).

However, it would be cool if someone wanted to tackle providing a more
general mechanism for defining Spark-friendly "images" that can be
used more generally. I had thought that docker might be a good way to
go for something like this - but maybe this packer thing is good too.

For one thing, if we had a standard image we could use it to create
containers for running Spark's unit test, which would be really cool.
This would help a lot with random issues around port and filesystem
contention we have for unit tests.

I'm not sure if the long term place for this would be inside the spark
codebase or a community library or what. But it would definitely be
very valuable to have if someone wanted to take it on.

- Patrick

On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
 wrote:
> FYI: There is an existing issue -- SPARK-3314
>  -- about scripting the
> creation of Spark AMIs.
>
> With Packer, it looks like we may be able to script the creation of
> multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
> single Packer template. That's very cool.
>
> I'll be looking into this.
>
> Nick
>
>
> On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas > wrote:
>
>> Thanks for the update, Nate. I'm looking forward to seeing how these
>> projects turn out.
>>
>> David, Packer looks very, very interesting. I'm gonna look into it more
>> next week.
>>
>> Nick
>>
>>
>> On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico  wrote:
>>
>>> Bit of progress on our end, bit of lagging as well.  Our guy leading
>>> effort got little bogged down on client project to update hive/sql testbed
>>> to latest spark/sparkSQL, also launching public service so we have been bit
>>> scattered recently.
>>>
>>> Will have some more updates probably after next week.  We are planning on
>>> taking our client work around hive/spark, plus taking over the bigtop
>>> automation work to modernize and get that fit for human consumption outside
>>> or org.  All our work and puppet modules will be open sourced, documented,
>>> hopefully start to rally some other folks around effort that find it useful
>>>
>>> Side note, another effort we are looking into is gradle tests/support.
>>> We have been leveraging serverspec for some basic infrastructure tests, but
>>> with bigtop switching over to gradle builds/testing setup in 0.8 we want to
>>> include support for that in our own efforts, probably some stuff that can
>>> be learned and leveraged in spark world for repeatable/tested infrastructure
>>>
>>> If anyone has any specific automation questions to your environment you
>>> can drop me a line directly.., will try to help out best I can.  Else will
>>> post update to dev list once we get on top of our own product release and
>>> the bigtop work
>>>
>>> Nate
>>>
>>>
>>> -Original Message-
>>> From: David Rowe [mailto:davidr...@gmail.com]
>>> Sent: Thursday, October 02, 2014 4:44 PM
>>> To: Nicholas Chammas
>>> Cc: dev; Shivaram Venkataraman
>>> Subject: Re: EC2 clusters ready in launch time + 30 seconds
>>>
>>> I think this is exactly what packer is for. See e.g.
>>> http://www.packer.io/intro/getting-started/build-image.html
>>>
>>> On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has
>>> a bad package for httpd, whcih causes ganglia not to start. For some reason
>>> I can't get access to the raw AMI to fix it.
>>>
>>> On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com
>>> > wrote:
>>>
>>> > Is there perhaps a way to define an AMI programmatically? Like, a
>>> > collection of base AMI id + list of required stuff to be installed +
>>> > list of required configuration changes. I'm guessing that's what
>>> > people use things like Puppet, Ansible, or maybe also AWS
>>> CloudFormation for, right?
>>> >
>>> > If we could do something like that, then with every new release of
>>> > Spark we could quickly and easily create new AMIs that have everything
>>> we need.
>>> > spark-ec2 would only have to bring up the instances and do a minimal
>>> > amount of configuration, and the only thing we'd need to track in the
>>> > Spark repo is the code that defines what goes on the AMI, as well as a
>>> > list of the AMI ids specific to each release.
>>> >
>>> > I'm just thinking out loud here. Does this make sense?
>>> >
>>> > Nate,
>>> >
>>> > Any progress on your end with this work?
>>> >
>>> > Nick
>>> >
>>> >
>>> > On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman <
>>> > shiva...@eecs.berkeley.edu> wrote:
>>> >
>>> > > It should be possible to improve cluster launch time if we are
>>> > > careful

Re: Extending Scala style checks

2014-10-01 Thread Patrick Wendell
Hey Nick,

We can always take built-in rules. Back when we added this Prashant
Sharma actually did some great work that lets us write our own style
rules in cases where rules don't exist.

You can see some existing rules here:
https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle

Prashant has over time contributed a lot of our custom rules upstream
to stalastyle, so now there are only a couple there.

- Patrick

On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu  wrote:
> Please take a look at WhitespaceEndOfLineChecker under:
> http://www.scalastyle.org/rules-0.1.0.html
>
> Cheers
>
> On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas > wrote:
>
>> As discussed here , it would be
>> good to extend our Scala style checks to programmatically enforce as many
>> of our style rules as possible.
>>
>> Does anyone know if it's relatively straightforward to enforce additional
>> rules like the "no trailing spaces" rule mentioned in the linked PR?
>>
>> Nick
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: do MIMA checking before all test cases start?

2014-09-25 Thread Patrick Wendell
Yeah we can also move it first. Wouldn't hurt.

On Thu, Sep 25, 2014 at 6:39 AM, Nicholas Chammas
 wrote:
> It might still make sense to make this change if MIMA checks are always
> relatively quick, for the same reason we do style checks first.
>
> On Thu, Sep 25, 2014 at 12:25 AM, Nan Zhu  wrote:
>>
>> yeah, I tried that, but there is always an issue when I ran dev/mima,
>>
>> it always gives me some binary compatibility error on Java API part
>>
>> so I have to wait for Jenkins' result when fixing MIMA issues
>>
>> --
>> Nan Zhu
>>
>>
>> On Thursday, September 25, 2014 at 12:04 AM, Patrick Wendell wrote:
>>
>> > Have you considered running the mima checks locally? We prefer people
>> > not use Jenkins for very frequent checks since it takes resources away
>> > from other people trying to run tests.
>> >
>> > On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu > > (mailto:zhunanmcg...@gmail.com)> wrote:
>> > > Hi, all
>> > >
>> > > It seems that, currently, Jenkins makes MIMA checking after all test
>> > > cases have finished, IIRC, during the first months we introduced MIMA, 
>> > > we do
>> > > the MIMA checking before running test cases
>> > >
>> > > What's the motivation to adjust this behaviour?
>> > >
>> > > In my opinion, if you have some binary compatibility issues, you just
>> > > need to do some minor changes, but in the current environment, you can 
>> > > only
>> > > get if your change works after all test cases finished (1 hour later...)
>> > >
>> > > Best,
>> > >
>> > > --
>> > > Nan Zhu
>> > >
>> >
>> >
>> >
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: do MIMA checking before all test cases start?

2014-09-24 Thread Patrick Wendell
Have you considered running the mima checks locally? We prefer people
not use Jenkins for very frequent checks since it takes resources away
from other people trying to run tests.

On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu  wrote:
> Hi, all
>
> It seems that, currently, Jenkins makes MIMA checking after all test cases 
> have finished, IIRC, during the first months we introduced MIMA, we do the 
> MIMA checking before running test cases
>
> What's the motivation to adjust this behaviour?
>
> In my opinion, if you have some binary compatibility issues, you just need to 
> do some minor changes, but in the current environment, you can only get if 
> your change works after all test cases finished (1 hour later...)
>
> Best,
>
> --
> Nan Zhu
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hash vs sort shuffle

2014-09-22 Thread Patrick Wendell
Hey Cody,

In terms of Spark 1.1.1 - we wouldn't change a default value in a spot
release. Changing this to default is slotted for 1.2.0:

https://issues.apache.org/jira/browse/SPARK-3280

- Patrick

On Mon, Sep 22, 2014 at 9:08 AM, Cody Koeninger  wrote:
> Unfortunately we were somewhat rushed to get things working again and did
> not keep the exact stacktraces, but one of the issues we saw was similar to
> that reported in
>
> https://issues.apache.org/jira/browse/SPARK-3032
>
> We also saw FAILED_TO_UNCOMPRESS errors from snappy when reading the
> shuffle file.
>
>
>
> On Mon, Sep 22, 2014 at 10:54 AM, Sandy Ryza 
> wrote:
>
>> Thanks for the heads up Cody.  Any indication of what was going wrong?
>>
>> On Mon, Sep 22, 2014 at 7:16 AM, Cody Koeninger 
>> wrote:
>>
>>> Just as a heads up, we deployed 471e6a3a of master (in order to get some
>>> sql fixes), and were seeing jobs fail until we set
>>>
>>> spark.shuffle.manager=HASH
>>>
>>> I'd be reluctant to change the default to sort for the 1.1.1 release
>>>
>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
that you are just having more spilling as a result of the patch and so
the filesystem is opening more files. I would try increasing the
ulimit.

How much memory do your executors have?

- Patrick

On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell  wrote:
> Hey the numbers you mentioned don't quite line up - did you mean PR 2711?
>
> On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin  wrote:
>> It seems like you just need to raise the ulimit?
>>
>>
>> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi  wrote:
>>
>>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
>>> workloads. Tried tracing the problem through change set analysis. Looks
>>> like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
>>> SPARK-3633 for more details.
>>>
>>> Thanks,
>>> Nishkam
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: BlockManager issues

2014-09-21 Thread Patrick Wendell
Hey the numbers you mentioned don't quite line up - did you mean PR 2711?

On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin  wrote:
> It seems like you just need to raise the ulimit?
>
>
> On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi  wrote:
>
>> Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
>> workloads. Tried tracing the problem through change set analysis. Looks
>> like the offending commit is 4fde28c from Aug 4th for PR1707. Please see
>> SPARK-3633 for more details.
>>
>> Thanks,
>> Nishkam
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark spilling location

2014-09-18 Thread Patrick Wendell
Yes - I believe we use the local dirs for spilling as well.

On Thu, Sep 18, 2014 at 7:57 AM, Tom Hubregtsen  wrote:
> Hi all,
>
> Just one line of context, since last post mentioned this would help:
> I'm currently writing my masters thesis (Computer Engineering) on storage
> and memory in both Spark and Hadoop.
>
> Right now I'm trying to analyze the spilling behavior of Spark, and I do not
> see what I expect. Therefor, I want to be sure that I am looking at the
> correct location.
>
> If I set spark.local.dir and SPARK_LOCAL_DIRS to, for instance, ~/temp
> instead of /tmp. Will this be the location where all data will be spilled
> to? I assume it is, based on the description of spark.local.dir at
> https://spark.apache.org/docs/latest/configuration.html:
> "Directory to use for "scratch" space in Spark, including map output files
> and RDDs that get stored on disk."
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-spilling-location-tp8471.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: greeting from new member and jira 3489

2014-09-16 Thread Patrick Wendell
Hi Mohit,

Welcome to the Spark community! We normally look at feature proposals
using github pull requests mind submitting one? The contribution
process is covered here:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

On Tue, Sep 16, 2014 at 9:16 PM, Mohit Jaggi  wrote:
> https://issues.apache.org/jira/browse/SPARK-3489
>
> Folks,
> I am Mohit Jaggi and I work for Ayasdi Inc. After experimenting with Spark
> for  a while and discovering its awesomeness(!) I made an attempt to
> provide a wrapper API that looks like R and/or pandas dataframe.
>
> https://github.com/AyasdiOpenSource/df
>
> "df" uses a collection of RDDs, each element in the collection being a
> column in a dataframe. To make rows from the columns I used zip() in a loop
> but that is not very efficient. I created JIRA 3489 requesting a zip()
> variant that zips a sequence of RDDs. I noticed that it was easy to write
> that code so I wrote that code and it seems to work. I attached the diff to
> the jira. I believe that this API would be useful in general and is not
> specific to "df". Please take a look at the request and the proposed
> solution and let me know what you think.
>
> Cheers,
> Mohit

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Wiki page for Operations/Monitoring tools?

2014-09-16 Thread Patrick Wendell
Hey Otis,

Could you describe a bit more about what your program is. Is it an
open source project? A product? This would help understand a bit where
it should go.

- Patrick

On Mon, Sep 15, 2014 at 6:49 PM, Otis Gospodnetic
 wrote:
> Hi,
>
> I'm looking for a suitable place on the Wiki to add some info about a Spark
> monitoring we've built.  The Wiki looks nice and orderly, so I didn't want
> to go in and mess it up without asking where to put such info.  I don't see
> an existing "Operations" or "Monitoring" or similar pages.  Should I just
> create a Child page under https://cwiki.apache.org/confluence/display/SPARK
> ?
>
> Thanks,
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Tests and Test Infrastructure

2014-09-13 Thread Patrick Wendell
Hey All,

Wanted to send a quick update about test infrastructure. With the
number of contributors we have and the rate of development,
maintaining a well-oiled test infra is really important.

Every time a flaky test fails a legitimate pull request, it wastes
developer time and effort.

1. Master build: Spark's master builds are back to green again in
Maven and SBT after a long time of instability. Big thanks to Josh
Rosen, Andrew Or, Nick Chammas, Shane Knapp, Sean Owen, and many
others who were involved in pinpointing and fixing fairly convoluted
test failure issues.

2. Jenkins PRB: The Jenkins Pull Request Builder is mostly functioning
again. However, we are working on a simpler technical pipeline for
testing patches, as this plug-in has been a constant source of
downtime and issues for us, and is very hard to debug.

3. Reverting flaky patches: Going forward - we may revert patches that
seem to be the root cause of flaky or failing tests. This is necessary
as these days, the test infra being down will block something like
10-30 in-flight patches on a given day. This puts the onus back on the
test writer to try and figure out what's going on - we'll of course
help debug the issue!

4. Time of tests: With hundreds (thousands?) of tests, we will have a
very high bar for tests which take several seconds or longer. Things
like Thread.sleep() bloat test time when proper synchronization
mechanisms should be used. Expect reviewers to push back on any
long-running tests, in many cases they can be re-written to be both
shorter and better.

Thanks again to everyone putting in effort on this, we've made a ton
of progress in the last few weeks. A solid test infra will help us
scale and move quickly as Spark development continues to accelerate.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Adding abstraction in MLlib

2014-09-12 Thread Patrick Wendell
We typically post design docs on JIRA's before major work starts. For
instance, pretty sure SPARk-1856 will have a design doc posted
shortly.

On Fri, Sep 12, 2014 at 12:10 PM, Erik Erlandson  wrote:
>
> Are interface designs being captured anywhere as documents that the community 
> can follow along with as the proposals evolve?
>
> I've worked on other open source projects where design docs were published as 
> "living documents" (e.g. on google docs, or etherpad, but the particular 
> mechanism isn't crucial).   FWIW, I found that to be a good way to work in a 
> community environment.
>
>
> - Original Message -
>> Hi Egor,
>>
>> Thanks for the feedback! We are aware of some of the issues you
>> mentioned and there are JIRAs created for them. Specifically, I'm
>> pushing out the design on pipeline features and algorithm/model
>> parameters this week. We can move our discussion to
>> https://issues.apache.org/jira/browse/SPARK-1856 .
>>
>> It would be nice to make tests against interfaces. But it definitely
>> needs more discussion before making PRs. For example, we discussed the
>> learning interfaces in Christoph's PR
>> (https://github.com/apache/spark/pull/2137/) but it takes time to
>> reach a consensus, especially on interfaces. Hopefully all of us could
>> benefit from the discussion. The best practice is to break down the
>> proposal into small independent piece and discuss them on the JIRA
>> before submitting PRs.
>>
>> For performance tests, there is a spark-perf package
>> (https://github.com/databricks/spark-perf) and we added performance
>> tests for MLlib in v1.1. But definitely more work needs to be done.
>>
>> The dev-list may not be a good place for discussion on the design,
>> could you create JIRAs for each of the issues you pointed out, and we
>> track the discussion on JIRA? Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Sep 12, 2014 at 10:45 AM, Reynold Xin  wrote:
>> > Xiangrui can comment more, but I believe Joseph and him are actually
>> > working on standardize interface and pipeline feature for 1.2 release.
>> >
>> > On Fri, Sep 12, 2014 at 8:20 AM, Egor Pahomov 
>> > wrote:
>> >
>> >> Some architect suggestions on this matter -
>> >> https://github.com/apache/spark/pull/2371
>> >>
>> >> 2014-09-12 16:38 GMT+04:00 Egor Pahomov :
>> >>
>> >> > Sorry, I misswrote  - I meant learners part of framework - models
>> >> > already
>> >> > exists.
>> >> >
>> >> > 2014-09-12 15:53 GMT+04:00 Christoph Sawade <
>> >> > christoph.saw...@googlemail.com>:
>> >> >
>> >> >> I totally agree, and we discovered also some drawbacks with the
>> >> >> classification models implementation that are based on GLMs:
>> >> >>
>> >> >> - There is no distinction between predicting scores, classes, and
>> >> >> calibrated scores (probabilities). For these models it is common to
>> >> >> have
>> >> >> access to all of them and the prediction function ``predict``should be
>> >> >> consistent and stateless. Currently, the score is only available after
>> >> >> removing the threshold from the model.
>> >> >> - There is no distinction between multinomial and binomial
>> >> >> classification. For multinomial problems, it is necessary to handle
>> >> >> multiple weight vectors and multiple confidences.
>> >> >> - Models are not serialisable, which makes it hard to use them in
>> >> >> practise.
>> >> >>
>> >> >> I started a pull request [1] some time ago. I would be happy to
>> >> >> continue
>> >> >> the discussion and clarify the interfaces, too!
>> >> >>
>> >> >> Cheers, Christoph
>> >> >>
>> >> >> [1] https://github.com/apache/spark/pull/2137/
>> >> >>
>> >> >> 2014-09-12 11:11 GMT+02:00 Egor Pahomov :
>> >> >>
>> >> >>> Here in Yandex, during implementation of gradient boosting in spark
>> >> >>> and
>> >> >>> creating our ML tool for internal use, we found next serious problems
>> >> in
>> >> >>> MLLib:
>> >> >>>
>> >> >>>
>> >> >>>- There is no Regression/Classification model abstraction. We were
>> >> >>>building abstract data processing pipelines, which should work just
>> >> >>> with
>> >> >>>some regression - exact algorithm specified outside this code.
>> >> >>>There
>> >> >>> is no
>> >> >>>abstraction, which will allow me to do that. *(It's main reason for
>> >> >>> all
>> >> >>>further problems) *
>> >> >>>- There is no common practice among MLlib for testing algorithms:
>> >> >>> every
>> >> >>>model generates it's own random test data. There is no easy
>> >> >>> extractable
>> >> >>>test cases applible to another algorithm. There is no benchmarks
>> >> >>>for
>> >> >>>comparing algorithms. After implementing new algorithm it's very
>> >> hard
>> >> >>> to
>> >> >>>understand how it should be tested.
>> >> >>>- Lack of serialization testing: MLlib algorithms don't contain
>> >> tests
>> >> >>>which test that model work after serialization.
>> >> >>>- During implementation of new algorithm it's hard to understand
>> >> what
>> >> >>> 

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@]

This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.

On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
 wrote:
> Hi,
>
> We have a use case where we are planning to keep sparkcontext alive in a
> server and run queries on it. But the issue is we have  a continuous
> flowing data the comes in batches of constant duration(say, 1hour). Now we
> want to exploit the schemaRDD and its benefits of columnar caching and
> compression. Is there a way I can append the new batch (uncached) to the
> older(cached) batch without losing the older data from cache and caching
> the whole dataset.
>
> Thanks and Regards,
>
>
> Archit Thakur.
> Sr Software Developer,
> Guavus, Inc.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [RESULT] [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-10 Thread Patrick Wendell
Hey just a heads up to everyone - running a bit behind on getting the
final artifacts and notes up. Finalizing this release was much more
complicated than previous ones due to new binary formats (we need to
redesign the download page a bit for this to work) and the large
increase in contributor count. Next time we can pipeline this work to
avoid a delay.

I did cut the v1.1.0 tag today. We should be able to do the full
announce tomorrow.

Thanks,
Patrick

On Sun, Sep 7, 2014 at 5:50 PM, Patrick Wendell  wrote:
> This vote passes with 8 binding +1 votes and no -1 votes. I'll post
> the final release in the next 48 hours... just finishing the release
> notes and packaging (which now takes a long time given the number of
> contributors!).
>
> +1:
> Reynold Xin*
> Michael Armbrust*
> Xiangrui Meng*
> Andrew Or*
> Sean Owen
> Matthew Farrellee
> Marcelo Vanzin
> Josh Rosen*
> Cheng Lian
> Mubarak Seyed
> Matei Zaharia*
> Nan Zhu
> Jeremy Freeman
> Denny Lee
> Tom Graves*
> Henry Saputra
> Egor Pahomov
> Rohit Sinha
> Kan Zhang
> Tathagata Das*
> Reza Zadeh
>
> -1:
>
> 0:
>
> * = binding

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Patrick Wendell
I think what Michael means is people often use this to read existing
partitioned Parquet tables that are defined in a Hive metastore rather
than data generated directly from within Spark and then reading it
back as a table. I'd expect the latter case to become more common, but
for now most users connect to an existing metastore.

I think you could go this route by creating a partitioned external
table based on the on-disk layout you create. The downside is that
you'd have to go through a hive metastore whereas what you are doing
now doesn't need hive at all.

We should also just fix the case you are mentioning where a union is
used directly from within spark. But that's the context.

- Patrick

On Tue, Sep 9, 2014 at 12:01 PM, Cody Koeninger  wrote:
> Maybe I'm missing something, I thought parquet was generally a write-once
> format and the sqlContext interface to it seems that way as well.
>
> d1.saveAsParquetFile("/foo/d1")
>
> // another day, another table, with same schema
> d2.saveAsParquetFile("/foo/d2")
>
> Will give a directory structure like
>
> /foo/d1/_metadata
> /foo/d1/part-r-1.parquet
> /foo/d1/part-r-2.parquet
> /foo/d1/_SUCCESS
>
> /foo/d2/_metadata
> /foo/d2/part-r-1.parquet
> /foo/d2/part-r-2.parquet
> /foo/d2/_SUCCESS
>
> // ParquetFileReader will fail, because /foo/d1 is a directory, not a
> parquet partition
> sqlContext.parquetFile("/foo")
>
> // works, but has the noted lack of pushdown
> sqlContext.parquetFile("/foo/d1").unionAll(sqlContext.parquetFile("/foo/d2"))
>
>
> Is there another alternative?
>
>
>
> On Tue, Sep 9, 2014 at 1:29 PM, Michael Armbrust 
> wrote:
>
>> I think usually people add these directories as multiple partitions of the
>> same table instead of union.  This actually allows us to efficiently prune
>> directories when reading in addition to standard column pruning.
>>
>> On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf 
>> wrote:
>>
>>> I'm kind of surprised this was not run into before.  Do people not
>>> segregate their data by day/week in the HDFS directory structure?
>>>
>>>
>>> On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust 
>>> wrote:
>>>
 Thanks!

 On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger 
 wrote:

 > Opened
 >
 > https://issues.apache.org/jira/browse/SPARK-3462
 >
 > I'll take a look at ColumnPruning and see what I can do
 >
 > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust <
 mich...@databricks.com>
 > wrote:
 >
 >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger 
 >> wrote:
 >>>
 >>> Is there a reason in general not to push projections and predicates
 down
 >>> into the individual ParquetTableScans in a union?
 >>>
 >>
 >> This would be a great case to add to ColumnPruning.  Would be awesome
 if
 >> you could open a JIRA or even a PR :)
 >>
 >
 >

>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RFC: Deprecating YARN-alpha API's

2014-09-09 Thread Patrick Wendell
Hi Everyone,

This is a call to the community for comments on SPARK-3445 [1]. In a
nutshell, we are trying to figure out timelines for deprecation of the
YARN-alpha API's as Yahoo is now moving off of them. It's helpful for
us to have a sense of whether anyone else uses these.

Please comment on the JIRA if you have feeback, thanks!

[1] https://issues.apache.org/jira/browse/SPARK-3445

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-07 Thread Patrick Wendell
This vote passes with 8 binding +1 votes and no -1 votes. I'll post
the final release in the next 48 hours... just finishing the release
notes and packaging (which now takes a long time given the number of
contributors!).

+1:
Reynold Xin*
Michael Armbrust*
Xiangrui Meng*
Andrew Or*
Sean Owen
Matthew Farrellee
Marcelo Vanzin
Josh Rosen*
Cheng Lian
Mubarak Seyed
Matei Zaharia*
Nan Zhu
Jeremy Freeman
Denny Lee
Tom Graves*
Henry Saputra
Egor Pahomov
Rohit Sinha
Kan Zhang
Tathagata Das*
Reza Zadeh

-1:

0:

* = binding

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Patrick Wendell
Hey There,

I believe this is on the roadmap for the 1.2 next release. But
Xiangrui can comment on this.

- Patrick

On Fri, Sep 5, 2014 at 9:18 AM, Yu Ishikawa
 wrote:
> Hi Evan,
>
> That's sounds interesting.
>
> Here is the ticket which I created.
> https://issues.apache.org/jira/browse/SPARK-3416
>
> thanks,
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8296.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: amplab jenkins is down

2014-09-04 Thread Patrick Wendell
Hm yeah it seems that it hasn't been polling since 3:45.

On Thu, Sep 4, 2014 at 4:21 PM, Nicholas Chammas
 wrote:
> It appears that our main man is having trouble hearing new requests.
>
> Do we need some smelling salts?
>
>
> On Thu, Sep 4, 2014 at 5:49 PM, shane knapp  wrote:
>>
>> i'd ping the Jenkinsmench...  the master was completely offline, so any
>> new
>> jobs wouldn't have reached it.  any jobs that were queued when power was
>> lost probably started up, but jobs that were running would fail.
>>
>>
>> On Thu, Sep 4, 2014 at 2:45 PM, Nicholas Chammas
>> > > wrote:
>>
>> > Woohoo! Thanks Shane.
>> >
>> > Do you know if queued PR builds will automatically be picked up? Or do
>> > we
>> > have to ping the Jenkinmensch manually from each PR?
>> >
>> > Nick
>> >
>> >
>> > On Thu, Sep 4, 2014 at 5:37 PM, shane knapp  wrote:
>> >
>> >> AND WE'RE UP!
>> >>
>> >> sorry that this took so long...  i'll send out a more detailed
>> >> explanation
>> >> of what happened soon.
>> >>
>> >> now, off to back up jenkins.
>> >>
>> >> shane
>> >>
>> >>
>> >> On Thu, Sep 4, 2014 at 1:27 PM, shane knapp 
>> >> wrote:
>> >>
>> >> > it's a faulty power switch on the firewall, which has been swapped
>> >> > out.
>> >> >  we're about to reboot and be good to go.
>> >> >
>> >> >
>> >> > On Thu, Sep 4, 2014 at 1:19 PM, shane knapp 
>> >> wrote:
>> >> >
>> >> >> looks like some hardware failed, and we're swapping in a
>> >> >> replacement.
>> >> i
>> >> >> don't have more specific information yet -- including *what* failed,
>> >> as our
>> >> >> sysadmin is super busy ATM.  the root cause was an incorrect circuit
>> >> being
>> >> >> switched off during building maintenance.
>> >> >>
>> >> >> on a side note, this incident will be accelerating our plan to move
>> >> >> the
>> >> >> entire jenkins infrastructure in to a managed datacenter
>> >> >> environment.
>> >> this
>> >> >> will be our major push over the next couple of weeks.  more details
>> >> about
>> >> >> this, also, as soon as i get them.
>> >> >>
>> >> >> i'm very sorry about the downtime, we'll get everything up and
>> >> >> running
>> >> >> ASAP.
>> >> >>
>> >> >>
>> >> >> On Thu, Sep 4, 2014 at 12:27 PM, shane knapp 
>> >> wrote:
>> >> >>
>> >> >>> looks like a power outage in soda hall.  more updates as they
>> >> >>> happen.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 4, 2014 at 12:25 PM, shane knapp 
>> >> >>> wrote:
>> >> >>>
>> >>  i am trying to get things up and running, but it looks like either
>> >> the
>> >>  firewall gateway or jenkins server itself is down.  i'll update as
>> >> soon as
>> >>  i know more.
>> >> 
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >
>> >  --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "amp-infra" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to amp-infra+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "amp-infra" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to amp-infra+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark
configuration settings.

On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷  wrote:
> Dear all:
>
> Spark uses memory to cache RDD and the memory size is specified by
> "spark.storage.memoryFraction".
>
> One the Executor starts, does Spark support adjusting/resizing memory size
> of this part dynamically?
>
> Thanks.
>
> --
> *Regards,*
> *Zhaojie*

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Patrick Wendell
Hey Nick,

Yeah we'll put those in the release notes.

On Wed, Sep 3, 2014 at 7:23 AM, Nicholas Chammas
 wrote:
> On Wed, Sep 3, 2014 at 3:24 AM, Patrick Wendell  wrote:
>>
>> == What default changes should I be aware of? ==
>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>> --> Old behavior can be restored by switching to "lzf"
>>
>> 2. PySpark now performs external spilling during aggregations.
>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>> "false".
>>
>> 3. PySpark uses a new heuristic for determining the parallelism of
>> shuffle operations.
>> --> Old behavior can be restored by setting
>> "spark.default.parallelism" to the number of cores in the cluster.
>
>
> Will these changes be called out in the release notes or somewhere in the
> docs?
>
> That last one (which I believe is what we discovered as the result of
> SPARK-) could have a large impact on PySpark users.
>
> Nick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Patrick Wendell
I'll kick it off with a +1

On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.1.0!
>
> The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.1.0-rc4/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1031/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/
>
> Please vote on releasing this package as Apache Spark 1.1.0!
>
> The vote is open until Saturday, September 06, at 08:30 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == Regressions fixed since RC3 ==
> SPARK-3332 - Issue with tagging in EC2 scripts
> SPARK-3358 - Issue with regression for m3.XX instances
>
> == What justifies a -1 vote for this release? ==
> This vote is happening very late into the QA period compared with
> previous votes, so -1 votes should only occur for significant
> regressions from 1.0.2. Bugs already present in 1.0.X will not block
> this release.
>
> == What default changes should I be aware of? ==
> 1. The default value of "spark.io.compression.codec" is now "snappy"
> --> Old behavior can be restored by switching to "lzf"
>
> 2. PySpark now performs external spilling during aggregations.
> --> Old behavior can be restored by setting "spark.shuffle.spill" to "false".
>
> 3. PySpark uses a new heuristic for determining the parallelism of
> shuffle operations.
> --> Old behavior can be restored by setting
> "spark.default.parallelism" to the number of cores in the cluster.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1031/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Saturday, September 06, at 08:30 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC3 ==
SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of "spark.io.compression.codec" is now "snappy"
--> Old behavior can be restored by switching to "lzf"

2. PySpark now performs external spilling during aggregations.
--> Old behavior can be restored by setting "spark.shuffle.spill" to "false".

3. PySpark uses a new heuristic for determining the parallelism of
shuffle operations.
--> Old behavior can be restored by setting
"spark.default.parallelism" to the number of cores in the cluster.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-03 Thread Patrick Wendell
I'm cancelling this release in favor of RC4. Happy voting!

On Tue, Sep 2, 2014 at 9:55 PM, Patrick Wendell  wrote:
> Thanks everyone for voting on this. There were two minor issues (one a
> blocker) were found that warrant cutting a new RC. For those who voted
> +1 on this release, I'd encourage you to +1 rc4 when it comes out
> unless you have been testing issues specific to the EC2 scripts. This
> will move the release process along.
>
> SPARK-3332 - Issue with tagging in EC2 scripts
> SPARK-3358 - Issue with regression for m3.XX instances
>
> - Patrick
>
> On Tue, Sep 2, 2014 at 6:55 PM, Nicholas Chammas
>  wrote:
>> In light of the discussion on SPARK-, I'll revoke my "-1" vote. The
>> issue does not appear to be serious.
>>
>>
>> On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas
>>  wrote:
>>>
>>> -1: I believe I've found a regression from 1.0.2. The report is captured
>>> in SPARK-.
>>>
>>>
>>> On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell 
>>> wrote:
>>>>
>>>> Please vote on releasing the following candidate as Apache Spark version
>>>> 1.1.0!
>>>>
>>>> The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
>>>>
>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1030/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
>>>>
>>>> Please vote on releasing this package as Apache Spark 1.1.0!
>>>>
>>>> The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.1.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see
>>>> http://spark.apache.org/
>>>>
>>>> == Regressions fixed since RC1 ==
>>>> - Build issue for SQL support:
>>>> https://issues.apache.org/jira/browse/SPARK-3234
>>>> - EC2 script version bump to 1.1.0.
>>>>
>>>> == What justifies a -1 vote for this release? ==
>>>> This vote is happening very late into the QA period compared with
>>>> previous votes, so -1 votes should only occur for significant
>>>> regressions from 1.0.2. Bugs already present in 1.0.X will not block
>>>> this release.
>>>>
>>>> == What default changes should I be aware of? ==
>>>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>>>> --> Old behavior can be restored by switching to "lzf"
>>>>
>>>> 2. PySpark now performs external spilling during aggregations.
>>>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>>>> "false".
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Patrick Wendell
Thanks everyone for voting on this. There were two minor issues (one a
blocker) were found that warrant cutting a new RC. For those who voted
+1 on this release, I'd encourage you to +1 rc4 when it comes out
unless you have been testing issues specific to the EC2 scripts. This
will move the release process along.

SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances

- Patrick

On Tue, Sep 2, 2014 at 6:55 PM, Nicholas Chammas
 wrote:
> In light of the discussion on SPARK-, I'll revoke my "-1" vote. The
> issue does not appear to be serious.
>
>
> On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas
>  wrote:
>>
>> -1: I believe I've found a regression from 1.0.2. The report is captured
>> in SPARK-3333.
>>
>>
>> On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell 
>> wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.1.0!
>>>
>>> The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
>>>
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1030/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
>>>
>>> Please vote on releasing this package as Apache Spark 1.1.0!
>>>
>>> The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
>>> a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> == Regressions fixed since RC1 ==
>>> - Build issue for SQL support:
>>> https://issues.apache.org/jira/browse/SPARK-3234
>>> - EC2 script version bump to 1.1.0.
>>>
>>> == What justifies a -1 vote for this release? ==
>>> This vote is happening very late into the QA period compared with
>>> previous votes, so -1 votes should only occur for significant
>>> regressions from 1.0.2. Bugs already present in 1.0.X will not block
>>> this release.
>>>
>>> == What default changes should I be aware of? ==
>>> 1. The default value of "spark.io.compression.codec" is now "snappy"
>>> --> Old behavior can be restored by switching to "lzf"
>>>
>>> 2. PySpark now performs external spilling during aggregations.
>>> --> Old behavior can be restored by setting "spark.shuffle.spill" to
>>> "false".
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Patrick Wendell
Hey Shane,

Thanks for your work so far and I'm really happy to see investment in
this infrastructure. This is a key productivity tool for us and
something we'd love to expand over time to improve the development
process of Spark.

- Patrick

On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
 wrote:
> Hi Shane!
>
> Thank you for doing the Jenkins upgrade last week. It's nice to know that
> infrastructure is gonna get some dedicated TLC going forward.
>
> Welcome aboard!
>
> Nick
>
>
> On Tue, Sep 2, 2014 at 1:35 PM, shane knapp  wrote:
>
>> so, i had a meeting w/the databricks guys on friday and they recommended i
>> send an email out to the list to say 'hi' and give you guys a quick intro.
>>  :)
>>
>> hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
>> time getting the jenkins build infrastructure up to production quality.
>>  much of this will be 'under the covers' work, like better system level
>> auth, backups, etc, but some will definitely be user facing:  timely
>> jenkins updates, debugging broken build infrastructure and some plugin
>> support.
>>
>> i've been working in the bay area now since 1997 at many different
>> companies, and my last 10 years has been split between google and palantir.
>>  i'm a huge proponent of OSS, and am really happy to be able to help with
>> the work you guys are doing!
>>
>> if anyone has any requests/questions/comments, feel free to drop me a line!
>>
>> shane
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Run the "Big Data Benchmark" for new releases

2014-09-01 Thread Patrick Wendell
Yeah, this wasn't detected in our performance tests. We even have a
test in PySpark that I would have though might catch this (it just
schedules a bunch of really small tasks, similar to the regression
case).

https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51

Anyways, Josh is trying to repro the regression to see if we can
figure out what is going on. If we find something for sure we should
add a test.

On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia  wrote:
> Nope, actually, they didn't find that (they found some other things that were 
> fixed, as well as some improvements). Feel free to send a PR, but it would be 
> good to profile the issue first to understand what slowed down. (For example 
> is the map phase taking longer or is it the reduce phase, is there some 
> difference in lengths of specific tasks, etc).
>
> Matei
>
> On September 1, 2014 at 10:03:20 PM, Nicholas Chammas 
> (nicholas.cham...@gmail.com) wrote:
>
> Oh, that's sweet. So, a related question then.
>
> Did those tests pick up the performance issue reported in SPARK-? Does it 
> make sense to add a new test to cover that case?
>
>
> On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia  
> wrote:
> Hi Nicholas,
>
> At Databricks we already run https://github.com/databricks/spark-perf for 
> each release, which is a more comprehensive performance test suite.
>
> Matei
>
> On September 1, 2014 at 8:22:05 PM, Nicholas Chammas 
> (nicholas.cham...@gmail.com) wrote:
>
> What do people think of running the Big Data Benchmark
>  (repo
> ) as part of preparing every new
> release of Spark?
>
> We'd run it just for Spark and effectively use it as another type of test
> to track any performance progress or regressions from release to release.
>
> Would doing such a thing be valuable? Do we already have a way of
> benchmarking Spark performance that we use regularly?
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread Patrick Wendell
For my part I'm +1 on this, though Sean it would be great separately
to fix the test environment.

For those who voted on rc2, this is almost identical, so feel free to
+1 unless you think there are issues with the two minor bug fixes.

On Sun, Aug 31, 2014 at 10:18 AM, Sean Owen  wrote:
> Fantastic. As it happens, I just fixed up Mahout's tests for Java 8
> and observed a lot of the same type of failure.
>
> I'm about to submit PRs for the two issues I identified. AFAICT these
> 3 then cover the failures I mentioned:
>
> https://issues.apache.org/jira/browse/SPARK-3329
> https://issues.apache.org/jira/browse/SPARK-3330
> https://issues.apache.org/jira/browse/SPARK-3331
>
> I'd argue that none necessarily block a release, since they just
> represent a problem with test-only code in Java 8, with the test-only
> context of Jenkins and multiple profiles, and with a trivial
> configuration in a style check for Python. Should be fixed but none
> indicate a bug in the release.
>
> On Sun, Aug 31, 2014 at 6:11 PM, Will Benton  wrote:
>> - Original Message -
>>
>>> dev/run-tests fails two tests (1 Hive, 1 Kafka Streaming) for me
>>> locally on 1.1.0-rc3. Does anyone else see that? It may be my env.
>>> Although I still see the Hive failure on Debian too:
>>>
>>> [info] - SET commands semantics for a HiveContext *** FAILED ***
>>> [info]   Expected Array("spark.sql.key.usedfortestonly=test.val.0",
>>> "spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0"),
>>> but got
>>> Array("spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0",
>>> "spark.sql.key.usedfortestonly=test.val.0") (HiveQuerySuite.scala:541)
>>
>> I've seen this error before.  (In particular, I've seen it on my OS X 
>> machine using Oracle JDK 8 but not on Fedora using OpenJDK.)  I've also seen 
>> similar errors in topic branches (but not on master) that seem to indicate 
>> that tests depend on sets of pairs arriving from Hive in a particular order; 
>> it seems that this isn't a safe assumption.
>>
>> I just submitted a (trivial) PR to fix this spurious failure:  
>> https://github.com/apache/spark/pull/2220
>>
>>
>> best,
>> wb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-30 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1030/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC1 ==
- Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of "spark.io.compression.codec" is now "snappy"
--> Old behavior can be restored by switching to "lzf"

2. PySpark now performs external spilling during aggregations.
--> Old behavior can be restored by setting "spark.shuffle.spill" to "false".

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-30 Thread Patrick Wendell
Thanks to Nick Chammas and Cheng Lian who pointed out two issues with
the release candidate. I'll cancel this in favor of RC3.

On Fri, Aug 29, 2014 at 1:33 PM, Jeremy Freeman
 wrote:
> +1. Validated several custom analysis pipelines on a private cluster in
> standalone mode. Tested new PySpark support for arbitrary Hadoop input
> formats, works great!
>
> -- Jeremy
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8143.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Compie error with XML elements

2014-08-29 Thread Patrick Wendell
In some cases IntelliJ's Scala compiler can't compile valid Scala
source files. Hopefully they fix (or have fixed) this in a newer
version.

- Patrick

On Fri, Aug 29, 2014 at 11:38 AM, Yi Tian  wrote:
> Hi, Devl!
>
> I got the same problem.
>
> You can try to upgrade your scala plugins to  0.41.2
>
> It works on my mac.
>
> On Aug 12, 2014, at 15:19, Devl Devel  wrote:
>
>> When compiling the master checkout of spark. The Intellij compile fails
>> with:
>>
>>Error:(45, 8) not found: value $scope
>>  
>>   ^
>> which is caused by HTML elements in classes like HistoryPage.scala:
>>
>>val content =
>>  
>>...
>>
>> How can I compile these classes that have html node elements in them?
>>
>> Thanks in advance.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Patrick Wendell
Oh darn - I missed this update. GRR, unfortunately I think this means
I'll need to cut a new RC. Thanks for catching this Nick.

On Fri, Aug 29, 2014 at 10:18 AM, Nicholas Chammas
 wrote:
> [Let me know if I should be posting these comments in a different thread.]
>
> Should the default Spark version in spark-ec2 be updated for this release?
>
> Nick
>
>
>
> On Fri, Aug 29, 2014 at 12:55 PM, Patrick Wendell 
> wrote:
>>
>> Hey Nicholas,
>>
>> Thanks for this, we can merge in doc changes outside of the actual
>> release timeline, so we'll make sure to loop those changes in before
>> we publish the final 1.1 docs.
>>
>> - Patrick
>>
>> On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas
>>  wrote:
>> > There were several formatting and typographical errors in the SQL docs
>> > that
>> > I've fixed in this PR. Dunno if we want to roll that into the release.
>> >
>> >
>> > On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell 
>> > wrote:
>> >>
>> >> Okay I'll plan to add cdh4 binary as well for the final release!
>> >>
>> >> ---
>> >> sent from my phone
>> >> On Aug 29, 2014 8:26 AM, "Ye Xianjin"  wrote:
>> >>
>> >> > We just used CDH 4.7 for our production cluster. And I believe we
>> >> > won't
>> >> > use CDH 5 in the next year.
>> >> >
>> >> > Sent from my iPhone
>> >> >
>> >> > > On 2014年8月29日, at 14:39, Matei Zaharia 
>> >> > > wrote:
>> >> > >
>> >> > > Personally I'd actually consider putting CDH4 back if there are
>> >> > > still
>> >> > users on it. It's always better to be inclusive, and the convenience
>> >> > of
>> >> > a
>> >> > one-click download is high. Do we have a sense on what % of CDH users
>> >> > still
>> >> > use CDH4?
>> >> > >
>> >> > > Matei
>> >> > >
>> >> > > On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com)
>> >> > > wrote:
>> >> > >
>> >> > > (Copying my reply since I don't know if it goes to the mailing
>> >> > > list)
>> >> > >
>> >> > > Great, thanks for explaining the reasoning. You're saying these
>> >> > > aren't
>> >> > > going into the final release? I think that moots any issue
>> >> > > surrounding
>> >> > > distributing them then.
>> >> > >
>> >> > > This is all I know of from the ASF:
>> >> > > https://community.apache.org/projectIndependence.html I don't read
>> >> > > it
>> >> > > as expressly forbidding this kind of thing although you can see how
>> >> > > it
>> >> > > bumps up against the spirit. There's not a bright line -- what
>> >> > > about
>> >> > > Tomcat providing binaries compiled for Windows for example? does
>> >> > > that
>> >> > > favor an OS vendor?
>> >> > >
>> >> > > From this technical ASF perspective only the releases matter -- do
>> >> > > what you want with snapshots and RCs. The only issue there is maybe
>> >> > > releasing something different than was in the RC; is that at all
>> >> > > confusing? Just needs a note.
>> >> > >
>> >> > > I think this theoretical issue doesn't exist if these binaries
>> >> > > aren't
>> >> > > released, so I see no reason to not proceed.
>> >> > >
>> >> > > The rest is a different question about whether you want to spend
>> >> > > time
>> >> > > maintaining this profile and candidate. The vendor already manages
>> >> > > their build I think and -- and I don't know -- may even prefer not
>> >> > > to
>> >> > > have a different special build floating around. There's also the
>> >> > > theoretical argument that this turns off other vendors from
>> >> > > adopting
>> >> > > Spark if it's perceived to be too connected to other vendors. I'd
>> >> > > like
>> >> > > to maximize Spark's d

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Patrick Wendell
Hey Nicholas,

Thanks for this, we can merge in doc changes outside of the actual
release timeline, so we'll make sure to loop those changes in before
we publish the final 1.1 docs.

- Patrick

On Fri, Aug 29, 2014 at 9:24 AM, Nicholas Chammas
 wrote:
> There were several formatting and typographical errors in the SQL docs that
> I've fixed in this PR. Dunno if we want to roll that into the release.
>
>
> On Fri, Aug 29, 2014 at 12:17 PM, Patrick Wendell 
> wrote:
>>
>> Okay I'll plan to add cdh4 binary as well for the final release!
>>
>> ---
>> sent from my phone
>> On Aug 29, 2014 8:26 AM, "Ye Xianjin"  wrote:
>>
>> > We just used CDH 4.7 for our production cluster. And I believe we won't
>> > use CDH 5 in the next year.
>> >
>> > Sent from my iPhone
>> >
>> > > On 2014年8月29日, at 14:39, Matei Zaharia 
>> > > wrote:
>> > >
>> > > Personally I'd actually consider putting CDH4 back if there are still
>> > users on it. It's always better to be inclusive, and the convenience of
>> > a
>> > one-click download is high. Do we have a sense on what % of CDH users
>> > still
>> > use CDH4?
>> > >
>> > > Matei
>> > >
>> > > On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com)
>> > > wrote:
>> > >
>> > > (Copying my reply since I don't know if it goes to the mailing list)
>> > >
>> > > Great, thanks for explaining the reasoning. You're saying these aren't
>> > > going into the final release? I think that moots any issue surrounding
>> > > distributing them then.
>> > >
>> > > This is all I know of from the ASF:
>> > > https://community.apache.org/projectIndependence.html I don't read it
>> > > as expressly forbidding this kind of thing although you can see how it
>> > > bumps up against the spirit. There's not a bright line -- what about
>> > > Tomcat providing binaries compiled for Windows for example? does that
>> > > favor an OS vendor?
>> > >
>> > > From this technical ASF perspective only the releases matter -- do
>> > > what you want with snapshots and RCs. The only issue there is maybe
>> > > releasing something different than was in the RC; is that at all
>> > > confusing? Just needs a note.
>> > >
>> > > I think this theoretical issue doesn't exist if these binaries aren't
>> > > released, so I see no reason to not proceed.
>> > >
>> > > The rest is a different question about whether you want to spend time
>> > > maintaining this profile and candidate. The vendor already manages
>> > > their build I think and -- and I don't know -- may even prefer not to
>> > > have a different special build floating around. There's also the
>> > > theoretical argument that this turns off other vendors from adopting
>> > > Spark if it's perceived to be too connected to other vendors. I'd like
>> > > to maximize Spark's distribution and there's some argument you do this
>> > > by not making vendor profiles. But as I say a different question to
>> > > just think about over time...
>> > >
>> > > (oh and PS for my part I think it's a good thing that CDH4 binaries
>> > > were removed. I wasn't arguing for resurrecting them)
>> > >
>> > >> On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell 
>> > wrote:
>> > >> Hey Sean,
>> > >>
>> > >> The reason there are no longer CDH-specific builds is that all newer
>> > >> versions of CDH and HDP work with builds for the upstream Hadoop
>> > >> projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and
>> > >> the Hadoop-without-Hive (also 2.4) build.
>> > >>
>> > >> For MapR - we can't officially post those artifacts on ASF web space
>> > >> when we make the final release, we can only link to them as being
>> > >> hosted by MapR specifically since they use non-compatible licenses.
>> > >> However, I felt that providing these during a testing period was
>> > >> alright, with the goal of increasing test coverage. I couldn't find
>> > >> any policy against posting these on personal web space during RC
>> > >> voting. However, we can remove them if there 

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Patrick Wendell
Okay I'll plan to add cdh4 binary as well for the final release!

---
sent from my phone
On Aug 29, 2014 8:26 AM, "Ye Xianjin"  wrote:

> We just used CDH 4.7 for our production cluster. And I believe we won't
> use CDH 5 in the next year.
>
> Sent from my iPhone
>
> > On 2014年8月29日, at 14:39, Matei Zaharia  wrote:
> >
> > Personally I'd actually consider putting CDH4 back if there are still
> users on it. It's always better to be inclusive, and the convenience of a
> one-click download is high. Do we have a sense on what % of CDH users still
> use CDH4?
> >
> > Matei
> >
> > On August 28, 2014 at 11:31:13 PM, Sean Owen (so...@cloudera.com) wrote:
> >
> > (Copying my reply since I don't know if it goes to the mailing list)
> >
> > Great, thanks for explaining the reasoning. You're saying these aren't
> > going into the final release? I think that moots any issue surrounding
> > distributing them then.
> >
> > This is all I know of from the ASF:
> > https://community.apache.org/projectIndependence.html I don't read it
> > as expressly forbidding this kind of thing although you can see how it
> > bumps up against the spirit. There's not a bright line -- what about
> > Tomcat providing binaries compiled for Windows for example? does that
> > favor an OS vendor?
> >
> > From this technical ASF perspective only the releases matter -- do
> > what you want with snapshots and RCs. The only issue there is maybe
> > releasing something different than was in the RC; is that at all
> > confusing? Just needs a note.
> >
> > I think this theoretical issue doesn't exist if these binaries aren't
> > released, so I see no reason to not proceed.
> >
> > The rest is a different question about whether you want to spend time
> > maintaining this profile and candidate. The vendor already manages
> > their build I think and -- and I don't know -- may even prefer not to
> > have a different special build floating around. There's also the
> > theoretical argument that this turns off other vendors from adopting
> > Spark if it's perceived to be too connected to other vendors. I'd like
> > to maximize Spark's distribution and there's some argument you do this
> > by not making vendor profiles. But as I say a different question to
> > just think about over time...
> >
> > (oh and PS for my part I think it's a good thing that CDH4 binaries
> > were removed. I wasn't arguing for resurrecting them)
> >
> >> On Fri, Aug 29, 2014 at 7:26 AM, Patrick Wendell 
> wrote:
> >> Hey Sean,
> >>
> >> The reason there are no longer CDH-specific builds is that all newer
> >> versions of CDH and HDP work with builds for the upstream Hadoop
> >> projects. I dropped CDH4 in favor of a newer Hadoop version (2.4) and
> >> the Hadoop-without-Hive (also 2.4) build.
> >>
> >> For MapR - we can't officially post those artifacts on ASF web space
> >> when we make the final release, we can only link to them as being
> >> hosted by MapR specifically since they use non-compatible licenses.
> >> However, I felt that providing these during a testing period was
> >> alright, with the goal of increasing test coverage. I couldn't find
> >> any policy against posting these on personal web space during RC
> >> voting. However, we can remove them if there is one.
> >>
> >> Dropping CDH4 was more because it is now pretty old, but we can add it
> >> back if people want. The binary packaging is a slightly separate
> >> question from release votes, so I can always add more binary packages
> >> whenever. And on this, my main concern is covering the most popular
> >> Hadoop versions to lower the bar for users to build and test Spark.
> >>
> >> - Patrick
> >>
> >>> On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen 
> wrote:
> >>> +1 I tested the source and Hadoop 2.4 release. Checksums and
> >>> signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't
> >>> fail any more than usual.
> >>>
> >>> FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another
> >>> project and have encountered no problems.
> >>>
> >>>
> >>> I notice that the 1.1.0 release removes the CDH4-specific build, but
> >>> adds two MapR-specific builds. Compare with
> >>> https://dist.apache.org/repos/dist/release/spark/spark-1.0.2/ I
&

<    1   2   3   4   5   6   7   >