date:20150409

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-09 Thread Sean McNamara

+1 tested on OS X

Sean

 On Apr 7, 2015, at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!
 
 The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
 
 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1083/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
 
 The patches on top of RC1 are:
 
 [SPARK-6737] Fix memory leak in OutputCommitCoordinator
 https://github.com/apache/spark/pull/5397
 
 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
 https://github.com/apache/spark/pull/5302
 
 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
 NoClassDefFoundError
 https://github.com/apache/spark/pull/4933
 
 Please vote on releasing this package as Apache Spark 1.3.1!
 
 The vote is open until Saturday, April 11, at 07:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: finding free ports for tests

2015-04-09 Thread Steve Loughran


On 8 Apr 2015, at 20:19, Hari Shreedharan 
hshreedha...@cloudera.commailto:hshreedha...@cloudera.com wrote:

One good way to guarantee your tests will work is to have your server bind to 
an ephemeral port and then query it to find the port it is running on. This 
ensures that race conditions don’t cause test failures.


yes, that's what I'm doing; the classic tactic. Find the tests fail if the 
laptop doesn't know its own name, but so do others


Thanks,
Hari



On Wed, Apr 8, 2015 at 3:24 AM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:

Utils.startServiceOnPort?

On Wed, Apr 8, 2015 at 6:16 AM, Steve Loughran 
ste...@hortonworks.commailto:ste...@hortonworks.com wrote:

 I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline 
 service integration, for which I need to allocate some free ports.

 I don't want to hard code them in as that can lead to unreliable tests, 
 especially on Jenkins.

 Before I implement the logic myself -Is there a utility class/trait for 
 finding ports for tests?

 -
 To unsubscribe, e-mail: 
 dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: 
dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org

Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin

Take a look at the following two files:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:

 Dear Developers,

 I'm trying to investigate the communication pattern regarding data-flow
 during execution of a Spark program defined by an RDD chain. I'm
 investigating from the Task point of view, and found out that the task type
 ResultTask (as retrieving the iterator for its RDD for a given partition),
 effectively asks the BlockManager to get the block from local or remote
 location. What I do there is to include actual location data in BlockResult
 so the task can tell where it retrieved the data from. I've found out that
 ResultTask can issue a data-flow only in this case.

 What's the case with the ShuffleMapTask? What happens there? I'm trying to
 log locations which are included in the shuffle process. I would be happy
 to receive a few hints regarding where remote communication is managed in
 case of ShuffleMapTask.

 Thanks!

 Zoltán

Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin

For torrent broadcast, data are read directly through the block manager:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167

On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:

Thanks! I've found the fetcher! Is there any other places and cases where
blocks are traveled through network?

Zvara Zoltán

mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com:

Take a look at the following two files:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

Dear Developers,

I'm trying to investigate the communication pattern regarding data-flow
during execution of a Spark program defined by an RDD chain. I'm
investigating from the Task point of view, and found out that the task
type
ResultTask (as retrieving the iterator for its RDD for a given
partition),
effectively asks the BlockManager to get the block from local or remote
location. What I do there is to include actual location data in
BlockResult
so the task can tell where it retrieved the data from. I've found out
that
ResultTask can issue a data-flow only in this case.

What's the case with the ShuffleMapTask? What happens there? I'm trying
to
log locations which are included in the shuffle process. I would be happy
to receive a few hints regarding where remote communication is managed in
case of ShuffleMapTask.

Thanks!

Zoltán

Connect to remote YARN cluster

2015-04-09 Thread Zoltán Zvara

I'm trying to debug Spark in yarn-client mode. On my local, single node
cluster everything works fine, but the remote YARN resource manager throws
away my request because of authentication error. I'm running IntelliJ 14 on
Ubuntu and the driver tries to connect to YARN with my local user name. How
can I force IntelliJ to run my code with a different user? Or how can I set
up the connection to YARN RM with auth data?

Thanks!

Zvara Zoltán



mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

Re: Spark remote communication pattern

2015-04-09 Thread Zoltán Zvara

Thanks! I've found the fetcher! Is there any other places and cases where
blocks are traveled through network?

Zvara Zoltán

mail, hangout, skype: zoltan.zv...@gmail.com

mobile, viber: +36203129543

bank: 10918001-0021-50480008

address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

elte: HSKSJZ (ZVZOAAI.ELTE)

2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com:

Take a look at the following two files:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

and

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

Dear Developers,

I'm trying to investigate the communication pattern regarding data-flow
during execution of a Spark program defined by an RDD chain. I'm
investigating from the Task point of view, and found out that the task
type
ResultTask (as retrieving the iterator for its RDD for a given partition),
effectively asks the BlockManager to get the block from local or remote
location. What I do there is to include actual location data in
BlockResult
so the task can tell where it retrieved the data from. I've found out that
ResultTask can issue a data-flow only in this case.

What's the case with the ShuffleMapTask? What happens there? I'm trying to
log locations which are included in the shuffle process. I would be happy
to receive a few hints regarding where remote communication is managed in
case of ShuffleMapTask.

Thanks!

Zoltán

Re: Connect to remote YARN cluster

2015-04-09 Thread Steve Loughran


 On 9 Apr 2015, at 17:42, Marcelo Vanzin van...@cloudera.com wrote:
 
 If YARN is authenticating users it's probably running on kerberos, so
 you need to log in with your kerberos credentials (kinit) before
 submitting an application.

also: make sure that you have the full JCE and not the crippled crypto; every 
time you upgrade the JDK you are likely to have to re-install it. Java gives no 
useful error messages on this or any other Kerberos problem

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

2015-04-09 Thread shane knapp

ok, we're looking good.  i'll keep an eye on this for the rest of the day,
and if you happen to notice any infrastructure failures before i do (i
updated a LOT), please let me know immediately!  :)

On Thu, Apr 9, 2015 at 8:38 AM, shane knapp skn...@berkeley.edu wrote:

 things are looking pretty good and i expect to be done within an hour.
  i've got some test builds running right now, and will give the green light
 when they successfully complete.

 On Thu, Apr 9, 2015 at 7:29 AM, shane knapp skn...@berkeley.edu wrote:

 and this is now happening.

 On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote:

 reminder!  this is happening thurday morning.

 On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote:

 welcome to python2.7+, java 8 and more!  :)

 i'll be doing a major upgrade to our build system next thursday
 morning.  here's a quick list of what's going on:

 * installation of anaconda python on all worker nodes

 * installation of pypy 2.5.1 (python 2.7) on all nodes

 * matching installation of python modules for the current system python
 (2.6), and anaconda python (2.6, 2.7 and 3.4)
   - anaconda python 2.7 will be the default for all workers (this has
 stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
 and i've noticed no test failures)
   - you can now use anaconda environments to specify which version of
 python to use in your tests:  http://www.continuum.io/blog/conda

 * installation of new python 2.7 modules:  pymongo requests six pymongo
 requests six python-crontab

 * bare-bones mongodb installation on all workers

 * installation of java 1.6 and 1.8 internal to jenkins
   - jobs will default to the system java, which is 1.7.0_75
   - if you want to run your tests w/java 6 or 8, you can select the JDK
 version of your choice in the job configuration page (it'll be towards the
 top)

 these changes have actually all been tested against a variety of builds
 (yay staging!) and while i'm certain that i have all of the kinks worked
 out, i'm going to schedule a longer downtime so that i have a chance to
 identify and squash any problems that surface.

 thanks to josh rosen, k. shankari and davies liu for helping me test
 all of this and get it working.

 shane

Re: Connect to remote YARN cluster

2015-04-09 Thread Marcelo Vanzin

If YARN is authenticating users it's probably running on kerberos, so
you need to log in with your kerberos credentials (kinit) before
submitting an application.

On Thu, Apr 9, 2015 at 4:57 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:
 I'm trying to debug Spark in yarn-client mode. On my local, single node
 cluster everything works fine, but the remote YARN resource manager throws
 away my request because of authentication error. I'm running IntelliJ 14 on
 Ubuntu and the driver tries to connect to YARN with my local user name. How
 can I force IntelliJ to run my code with a different user? Or how can I set
 up the connection to YARN RM with auth data?

 Thanks!

 Zvara Zoltán



 mail, hangout, skype: zoltan.zv...@gmail.com

 mobile, viber: +36203129543

 bank: 10918001-0021-50480008

 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a

 elte: HSKSJZ (ZVZOAAI.ELTE)



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: enum-like types in Spark

2015-04-09 Thread Imran Rashid

any update here? This is relevant for a currently open PR of mine -- I've
got a bunch of new public constants defined w/ format #4, but I'd gladly
switch to java enums. (Even if we are just going to postpone this
decision, I'm still inclined to switch to java enums ...)

just to be clear about the existing problem with enums scaladoc: right
now, the scaladoc knows about the enum class, and generates a page for it,
but it does not display the enum constants. It is at least labeled as a
java enum, though, so a savvy user could switch to the javadocs to see the
constants.

On Mon, Mar 23, 2015 at 4:50 PM, Imran Rashid iras...@cloudera.com wrote:

well, perhaps I overstated things a little, I wouldn't call it the
official solution, just a recommendation in the never-ending debate (and
the recommendation from folks with their hands on scala itself).

Even if we do get this fixed in scaladoc eventually -- as its not in the
current versions, where does that leave this proposal? personally I'd
*still* prefer java enums, even if it doesn't get into scaladoc. btw, even
with sealed traits, the scaladoc still isn't great -- you don't see the
values from the class, you only see them listed from the companion object.
(though, that is somewhat standard for scaladoc, so maybe I'm reaching a
little)

On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com
wrote:

If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
Yeah the fully realized #4, which gets back the ability to use it in
switch statements (? in Scala but not Java?) does end up being kind of
huge.

I confess I'm swayed a bit back to Java enums, seeing what it
involves. The hashCode() issue can be 'solved' with the hash of the
String representation.

On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
wrote:
I've just switched some of my code over to the new format, and I just
want
to make sure everyone realizes what we are getting into. I went from
10
lines as java enums

https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20

to 30 lines with the new format:

https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250

its not just that its verbose. each name has to be repeated 4 times,
with
potential typos in some locations that won't be caught by the compiler.
Also, you have to manually maintain the values as you update the set
of
enums, the compiler won't do it for you.

The only downside I've heard for java enums is enum.hashcode(). OTOH,
the
downsides for this version are: maintainability / verbosity, no
values(),
more cumbersome to use from java, no enum map / enumset.

I did put together a little util to at least get back the equivalent of
enum.valueOf() with this format

https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala

I'm not trying to prevent us from moving forward on this, its fine if
this
is still what everyone wants, but I feel pretty strongly java enums
make
more sense.

thanks,
Imran

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

2015-04-09 Thread shane knapp

and this is now happening.

On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote:

 reminder!  this is happening thurday morning.

 On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote:

 welcome to python2.7+, java 8 and more!  :)

 i'll be doing a major upgrade to our build system next thursday morning.
  here's a quick list of what's going on:

 * installation of anaconda python on all worker nodes

 * installation of pypy 2.5.1 (python 2.7) on all nodes

 * matching installation of python modules for the current system python
 (2.6), and anaconda python (2.6, 2.7 and 3.4)
   - anaconda python 2.7 will be the default for all workers (this has
 stealthily been the case on amp-jenkins-worker-01 for the past two weeks,
 and i've noticed no test failures)
   - you can now use anaconda environments to specify which version of
 python to use in your tests:  http://www.continuum.io/blog/conda

 * installation of new python 2.7 modules:  pymongo requests six pymongo
 requests six python-crontab

 * bare-bones mongodb installation on all workers

 * installation of java 1.6 and 1.8 internal to jenkins
   - jobs will default to the system java, which is 1.7.0_75
   - if you want to run your tests w/java 6 or 8, you can select the JDK
 version of your choice in the job configuration page (it'll be towards the
 top)

 these changes have actually all been tested against a variety of builds
 (yay staging!) and while i'm certain that i have all of the kinks worked
 out, i'm going to schedule a longer downtime so that i have a chance to
 identify and squash any problems that surface.

 thanks to josh rosen, k. shankari and davies liu for helping me test all
 of this and get it working.

 shane

Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

Re: finding free ports for tests

Re: Spark remote communication pattern

Re: Spark remote communication pattern

Connect to remote YARN cluster

Re: Spark remote communication pattern

Re: Connect to remote YARN cluster

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

Re: Connect to remote YARN cluster

Re: enum-like types in Spark

Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)

11 matches

Site Navigation

Mail list logo

Footer information