Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-09 Thread Sean McNamara
+1 tested on OS X

Sean

 On Apr 7, 2015, at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!
 
 The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
 
 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1083/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
 
 The patches on top of RC1 are:
 
 [SPARK-6737] Fix memory leak in OutputCommitCoordinator
 https://github.com/apache/spark/pull/5397
 
 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
 https://github.com/apache/spark/pull/5302
 
 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
 NoClassDefFoundError
 https://github.com/apache/spark/pull/4933
 
 Please vote on releasing this package as Apache Spark 1.3.1!
 
 The vote is open until Saturday, April 11, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: finding free ports for tests

2015-04-09 Thread Steve Loughran

On 8 Apr 2015, at 20:19, Hari Shreedharan 
hshreedha...@cloudera.commailto:hshreedha...@cloudera.com wrote:

One good way to guarantee your tests will work is to have your server bind to 
an ephemeral port and then query it to find the port it is running on. This 
ensures that race conditions don’t cause test failures.


yes, that's what I'm doing; the classic tactic. Find the tests fail if the 
laptop doesn't know its own name, but so do others


Thanks,
Hari



On Wed, Apr 8, 2015 at 3:24 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:

Utils.startServiceOnPort?

On Wed, Apr 8, 2015 at 6:16 AM, Steve Loughran 
ste...@hortonworks.commailto:ste...@hortonworks.com wrote:

 I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline 
 service integration, for which I need to allocate some free ports.

 I don't want to hard code them in as that can lead to unreliable tests, 
 especially on Jenkins.

 Before I implement the logic myself -Is there a utility class/trait for 
 finding ports for tests?

 -
 To unsubscribe, e-mail: 
 dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org





Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
Take a look at the following two files:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:

 Dear Developers,

 I'm trying to investigate the communication pattern regarding data-flow
 during execution of a Spark program defined by an RDD chain. I'm
 investigating from the Task point of view, and found out that the task type
 ResultTask (as retrieving the iterator for its RDD for a given partition),
 effectively asks the BlockManager to get the block from local or remote
 location. What I do there is to include actual location data in BlockResult
 so the task can tell where it retrieved the data from. I've found out that
 ResultTask can issue a data-flow only in this case.

 What's the case with the ShuffleMapTask? What happens there? I'm trying to
 log locations which are included in the shuffle process. I would be happy
 to receive a few hints regarding where remote communication is managed in
 case of ShuffleMapTask.

 Thanks!

 Zoltán



Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
For torrent broadcast, data are read directly through the block manager:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167



On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:

 Thanks! I've found the fetcher! Is there any other places and cases where
 blocks are traveled through network?

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)

 2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com:

 Take a look at the following two files:


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

 and


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

 On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com
 wrote:

 Dear Developers,

 I'm trying to investigate the communication pattern regarding data-flow
 during execution of a Spark program defined by an RDD chain. I'm
 investigating from the Task point of view, and found out that the task
 type
 ResultTask (as retrieving the iterator for its RDD for a given
 partition),
 effectively asks the BlockManager to get the block from local or remote
 location. What I do there is to include actual location data in
 BlockResult
 so the task can tell where it retrieved the data from. I've found out
 that
 ResultTask can issue a data-flow only in this case.

 What's the case with the ShuffleMapTask? What happens there? I'm trying
 to
 log locations which are included in the shuffle process. I would be happy
 to receive a few hints regarding where remote communication is managed in
 case of ShuffleMapTask.

 Thanks!

 Zoltán






Connect to remote YARN cluster

2015-04-09 Thread Zoltán Zvara
I'm trying to debug Spark in yarn-client mode. On my local, single node
cluster everything works fine, but the remote YARN resource manager throws
away my request because of authentication error. I'm running IntelliJ 14 on
Ubuntu and the driver tries to connect to YARN with my local user name. How
can I force IntelliJ to run my code with a different user? Or how can I set
up the connection to YARN RM with auth data?

Thanks!

Zvara Zoltán



mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)


Re: Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara
Thanks! I've found the fetcher! Is there any other places and cases where
blocks are traveled through network?

Zvara Zoltán



mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com:

 Take a look at the following two files:


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

 and


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

 On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com
 wrote:

 Dear Developers,

 I'm trying to investigate the communication pattern regarding data-flow
 during execution of a Spark program defined by an RDD chain. I'm
 investigating from the Task point of view, and found out that the task
 type
 ResultTask (as retrieving the iterator for its RDD for a given partition),
 effectively asks the BlockManager to get the block from local or remote
 location. What I do there is to include actual location data in
 BlockResult
 so the task can tell where it retrieved the data from. I've found out that
 ResultTask can issue a data-flow only in this case.

 What's the case with the ShuffleMapTask? What happens there? I'm trying to
 log locations which are included in the shuffle process. I would be happy
 to receive a few hints regarding where remote communication is managed in
 case of ShuffleMapTask.

 Thanks!

 Zoltán





Re: Connect to remote YARN cluster

2015-04-09 Thread Steve Loughran

 On 9 Apr 2015, at 17:42, Marcelo Vanzin van...@cloudera.com wrote:
 
 If YARN is authenticating users it's probably running on kerberos, so
 you need to log in with your kerberos credentials (kinit) before
 submitting an application.

also: make sure that you have the full JCE and not the crippled crypto; every 
time you upgrade the JDK you are likely to have to re-install it. Java gives no 
useful error messages on this or any other Kerberos problem

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

2015-04-09 Thread shane knapp
ok, we're looking good.  i'll keep an eye on this for the rest of the day,
and if you happen to notice any infrastructure failures before i do (i
updated a LOT), please let me know immediately!  :)

On Thu, Apr 9, 2015 at 8:38 AM, shane knapp skn...@berkeley.edu wrote:

 things are looking pretty good and i expect to be done within an hour.
  i've got some test builds running right now, and will give the green light
 when they successfully complete.

 On Thu, Apr 9, 2015 at 7:29 AM, shane knapp skn...@berkeley.edu wrote:

 and this is now happening.

 On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote:

 reminder!  this is happening thurday morning.

 On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote:

 welcome to python2.7+, java 8 and more!  :)

 i'll be doing a major upgrade to our build system next thursday
 morning.  here's a quick list of what's going on:

 * installation of anaconda python on all worker nodes

 * installation of pypy 2.5.1 (python 2.7) on all nodes

 * matching installation of python modules for the current system python
 (2.6), and anaconda python (2.6, 2.7 and 3.4)
   - anaconda python 2.7 will be the default for all workers (this has
 stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
 and i've noticed no test failures)
   - you can now use anaconda environments to specify which version of
 python to use in your tests:  http://www.continuum.io/blog/conda

 * installation of new python 2.7 modules:  pymongo requests six pymongo
 requests six python-crontab

 * bare-bones mongodb installation on all workers

 * installation of java 1.6 and 1.8 internal to jenkins
   - jobs will default to the system java, which is 1.7.0_75
   - if you want to run your tests w/java 6 or 8, you can select the JDK
 version of your choice in the job configuration page (it'll be towards the
 top)

 these changes have actually all been tested against a variety of builds
 (yay staging!) and while i'm certain that i have all of the kinks worked
 out, i'm going to schedule a longer downtime so that i have a chance to
 identify and squash any problems that surface.

 thanks to josh rosen, k. shankari and davies liu for helping me test
 all of this and get it working.

 shane







Re: Connect to remote YARN cluster

2015-04-09 Thread Marcelo Vanzin
If YARN is authenticating users it's probably running on kerberos, so
you need to log in with your kerberos credentials (kinit) before
submitting an application.

On Thu, Apr 9, 2015 at 4:57 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:
 I'm trying to debug Spark in yarn-client mode. On my local, single node
 cluster everything works fine, but the remote YARN resource manager throws
 away my request because of authentication error. I'm running IntelliJ 14 on
 Ubuntu and the driver tries to connect to YARN with my local user name. How
 can I force IntelliJ to run my code with a different user? Or how can I set
 up the connection to YARN RM with auth data?

 Thanks!

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: enum-like types in Spark

2015-04-09 Thread Imran Rashid
any update here?  This is relevant for a currently open PR of mine -- I've
got a bunch of new public constants defined w/ format #4, but I'd gladly
switch to java enums.  (Even if we are just going to postpone this
decision, I'm still inclined to switch to java enums ...)

just to be clear about the existing problem with enums  scaladoc: right
now, the scaladoc knows about the enum class, and generates a page for it,
but it does not display the enum constants.  It is at least labeled as a
java enum, though, so a savvy user could switch to the javadocs to see the
constants.



On Mon, Mar 23, 2015 at 4:50 PM, Imran Rashid iras...@cloudera.com wrote:

 well, perhaps I overstated things a little, I wouldn't call it the
 official solution, just a recommendation in the never-ending debate (and
 the recommendation from folks with their hands on scala itself).

 Even if we do get this fixed in scaladoc eventually -- as its not in the
 current versions, where does that leave this proposal?  personally I'd
 *still* prefer java enums, even if it doesn't get into scaladoc.  btw, even
 with sealed traits, the scaladoc still isn't great -- you don't see the
 values from the class, you only see them listed from the companion object.
  (though, that is somewhat standard for scaladoc, so maybe I'm reaching a
 little)



 On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 If the official solution from the Scala community is to use Java
 enums, then it seems strange they aren't generated in scaldoc? Maybe
 we can just fix that w/ Typesafe's help and then we can use them.

 On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
  Yeah the fully realized #4, which gets back the ability to use it in
  switch statements (? in Scala but not Java?) does end up being kind of
  huge.
 
  I confess I'm swayed a bit back to Java enums, seeing what it
  involves. The hashCode() issue can be 'solved' with the hash of the
  String representation.
 
  On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
 wrote:
  I've just switched some of my code over to the new format, and I just
 want
  to make sure everyone realizes what we are getting into.  I went from
 10
  lines as java enums
 
 
 https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
 
  to 30 lines with the new format:
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
 
  its not just that its verbose.  each name has to be repeated 4 times,
 with
  potential typos in some locations that won't be caught by the compiler.
  Also, you have to manually maintain the values as you update the set
 of
  enums, the compiler won't do it for you.
 
  The only downside I've heard for java enums is enum.hashcode().  OTOH,
 the
  downsides for this version are: maintainability / verbosity, no
 values(),
  more cumbersome to use from java, no enum map / enumset.
 
  I did put together a little util to at least get back the equivalent of
  enum.valueOf() with this format
 
 
 https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
 
  I'm not trying to prevent us from moving forward on this, its fine if
 this
  is still what everyone wants, but I feel pretty strongly java enums
 make
  more sense.
 
  thanks,
  Imran
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 





Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

2015-04-09 Thread shane knapp
and this is now happening.

On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote:

 reminder!  this is happening thurday morning.

 On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote:

 welcome to python2.7+, java 8 and more!  :)

 i'll be doing a major upgrade to our build system next thursday morning.
  here's a quick list of what's going on:

 * installation of anaconda python on all worker nodes

 * installation of pypy 2.5.1 (python 2.7) on all nodes

 * matching installation of python modules for the current system python
 (2.6), and anaconda python (2.6, 2.7 and 3.4)
   - anaconda python 2.7 will be the default for all workers (this has
 stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
 and i've noticed no test failures)
   - you can now use anaconda environments to specify which version of
 python to use in your tests:  http://www.continuum.io/blog/conda

 * installation of new python 2.7 modules:  pymongo requests six pymongo
 requests six python-crontab

 * bare-bones mongodb installation on all workers

 * installation of java 1.6 and 1.8 internal to jenkins
   - jobs will default to the system java, which is 1.7.0_75
   - if you want to run your tests w/java 6 or 8, you can select the JDK
 version of your choice in the job configuration page (it'll be towards the
 top)

 these changes have actually all been tested against a variety of builds
 (yay staging!) and while i'm certain that i have all of the kinks worked
 out, i'm going to schedule a longer downtime so that i have a chance to
 identify and squash any problems that surface.

 thanks to josh rosen, k. shankari and davies liu for helping me test all
 of this and get it working.

 shane