Re: [VOTE] Release Apache Spark 1.3.1 (RC2)
+1 tested on OS X Sean On Apr 7, 2015, at 11:46 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 7c4473a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1083/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/ The patches on top of RC1 are: [SPARK-6737] Fix memory leak in OutputCommitCoordinator https://github.com/apache/spark/pull/5397 [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py https://github.com/apache/spark/pull/5302 [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError https://github.com/apache/spark/pull/4933 Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Saturday, April 11, at 07:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: finding free ports for tests
On 8 Apr 2015, at 20:19, Hari Shreedharan hshreedha...@cloudera.commailto:hshreedha...@cloudera.com wrote: One good way to guarantee your tests will work is to have your server bind to an ephemeral port and then query it to find the port it is running on. This ensures that race conditions don’t cause test failures. yes, that's what I'm doing; the classic tactic. Find the tests fail if the laptop doesn't know its own name, but so do others Thanks, Hari On Wed, Apr 8, 2015 at 3:24 AM, Sean Owen so...@cloudera.commailto:so...@cloudera.com wrote: Utils.startServiceOnPort? On Wed, Apr 8, 2015 at 6:16 AM, Steve Loughran ste...@hortonworks.commailto:ste...@hortonworks.com wrote: I'm writing some functional tests for the SPARK-1537 JIRA, Yarn timeline service integration, for which I need to allocate some free ports. I don't want to hard code them in as that can lead to unreliable tests, especially on Jenkins. Before I implement the logic myself -Is there a utility class/trait for finding ports for tests? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org
Re: Spark remote communication pattern
Take a look at the following two files: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Dear Developers, I'm trying to investigate the communication pattern regarding data-flow during execution of a Spark program defined by an RDD chain. I'm investigating from the Task point of view, and found out that the task type ResultTask (as retrieving the iterator for its RDD for a given partition), effectively asks the BlockManager to get the block from local or remote location. What I do there is to include actual location data in BlockResult so the task can tell where it retrieved the data from. I've found out that ResultTask can issue a data-flow only in this case. What's the case with the ShuffleMapTask? What happens there? I'm trying to log locations which are included in the shuffle process. I would be happy to receive a few hints regarding where remote communication is managed in case of ShuffleMapTask. Thanks! Zoltán
Re: Spark remote communication pattern
For torrent broadcast, data are read directly through the block manager: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L167 On Thu, Apr 9, 2015 at 7:27 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Thanks! I've found the fetcher! Is there any other places and cases where blocks are traveled through network? Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com: Take a look at the following two files: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Dear Developers, I'm trying to investigate the communication pattern regarding data-flow during execution of a Spark program defined by an RDD chain. I'm investigating from the Task point of view, and found out that the task type ResultTask (as retrieving the iterator for its RDD for a given partition), effectively asks the BlockManager to get the block from local or remote location. What I do there is to include actual location data in BlockResult so the task can tell where it retrieved the data from. I've found out that ResultTask can issue a data-flow only in this case. What's the case with the ShuffleMapTask? What happens there? I'm trying to log locations which are included in the shuffle process. I would be happy to receive a few hints regarding where remote communication is managed in case of ShuffleMapTask. Thanks! Zoltán
Connect to remote YARN cluster
I'm trying to debug Spark in yarn-client mode. On my local, single node cluster everything works fine, but the remote YARN resource manager throws away my request because of authentication error. I'm running IntelliJ 14 on Ubuntu and the driver tries to connect to YARN with my local user name. How can I force IntelliJ to run my code with a different user? Or how can I set up the connection to YARN RM with auth data? Thanks! Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE)
Re: Spark remote communication pattern
Thanks! I've found the fetcher! Is there any other places and cases where blocks are traveled through network? Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) 2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com: Take a look at the following two files: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala On Thu, Apr 9, 2015 at 1:15 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Dear Developers, I'm trying to investigate the communication pattern regarding data-flow during execution of a Spark program defined by an RDD chain. I'm investigating from the Task point of view, and found out that the task type ResultTask (as retrieving the iterator for its RDD for a given partition), effectively asks the BlockManager to get the block from local or remote location. What I do there is to include actual location data in BlockResult so the task can tell where it retrieved the data from. I've found out that ResultTask can issue a data-flow only in this case. What's the case with the ShuffleMapTask? What happens there? I'm trying to log locations which are included in the shuffle process. I would be happy to receive a few hints regarding where remote communication is managed in case of ShuffleMapTask. Thanks! Zoltán
Re: Connect to remote YARN cluster
On 9 Apr 2015, at 17:42, Marcelo Vanzin van...@cloudera.com wrote: If YARN is authenticating users it's probably running on kerberos, so you need to log in with your kerberos credentials (kinit) before submitting an application. also: make sure that you have the full JCE and not the crippled crypto; every time you upgrade the JDK you are likely to have to re-install it. Java gives no useful error messages on this or any other Kerberos problem - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)
ok, we're looking good. i'll keep an eye on this for the rest of the day, and if you happen to notice any infrastructure failures before i do (i updated a LOT), please let me know immediately! :) On Thu, Apr 9, 2015 at 8:38 AM, shane knapp skn...@berkeley.edu wrote: things are looking pretty good and i expect to be done within an hour. i've got some test builds running right now, and will give the green light when they successfully complete. On Thu, Apr 9, 2015 at 7:29 AM, shane knapp skn...@berkeley.edu wrote: and this is now happening. On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote: reminder! this is happening thurday morning. On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote: welcome to python2.7+, java 8 and more! :) i'll be doing a major upgrade to our build system next thursday morning. here's a quick list of what's going on: * installation of anaconda python on all worker nodes * installation of pypy 2.5.1 (python 2.7) on all nodes * matching installation of python modules for the current system python (2.6), and anaconda python (2.6, 2.7 and 3.4) - anaconda python 2.7 will be the default for all workers (this has stealthily been the case on amp-jenkins-worker-01 for the past two weeks, and i've noticed no test failures) - you can now use anaconda environments to specify which version of python to use in your tests: http://www.continuum.io/blog/conda * installation of new python 2.7 modules: pymongo requests six pymongo requests six python-crontab * bare-bones mongodb installation on all workers * installation of java 1.6 and 1.8 internal to jenkins - jobs will default to the system java, which is 1.7.0_75 - if you want to run your tests w/java 6 or 8, you can select the JDK version of your choice in the job configuration page (it'll be towards the top) these changes have actually all been tested against a variety of builds (yay staging!) and while i'm certain that i have all of the kinks worked out, i'm going to schedule a longer downtime so that i have a chance to identify and squash any problems that surface. thanks to josh rosen, k. shankari and davies liu for helping me test all of this and get it working. shane
Re: Connect to remote YARN cluster
If YARN is authenticating users it's probably running on kerberos, so you need to log in with your kerberos credentials (kinit) before submitting an application. On Thu, Apr 9, 2015 at 4:57 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: I'm trying to debug Spark in yarn-client mode. On my local, single node cluster everything works fine, but the remote YARN resource manager throws away my request because of authentication error. I'm running IntelliJ 14 on Ubuntu and the driver tries to connect to YARN with my local user name. How can I force IntelliJ to run my code with a different user? Or how can I set up the connection to YARN RM with auth data? Thanks! Zvara Zoltán mail, hangout, skype: zoltan.zv...@gmail.com mobile, viber: +36203129543 bank: 10918001-0021-50480008 address: Hungary, 2475 Kápolnásnyék, Kossuth 6/a elte: HSKSJZ (ZVZOAAI.ELTE) -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: enum-like types in Spark
any update here? This is relevant for a currently open PR of mine -- I've got a bunch of new public constants defined w/ format #4, but I'd gladly switch to java enums. (Even if we are just going to postpone this decision, I'm still inclined to switch to java enums ...) just to be clear about the existing problem with enums scaladoc: right now, the scaladoc knows about the enum class, and generates a page for it, but it does not display the enum constants. It is at least labeled as a java enum, though, so a savvy user could switch to the javadocs to see the constants. On Mon, Mar 23, 2015 at 4:50 PM, Imran Rashid iras...@cloudera.com wrote: well, perhaps I overstated things a little, I wouldn't call it the official solution, just a recommendation in the never-ending debate (and the recommendation from folks with their hands on scala itself). Even if we do get this fixed in scaladoc eventually -- as its not in the current versions, where does that leave this proposal? personally I'd *still* prefer java enums, even if it doesn't get into scaladoc. btw, even with sealed traits, the scaladoc still isn't great -- you don't see the values from the class, you only see them listed from the companion object. (though, that is somewhat standard for scaladoc, so maybe I'm reaching a little) On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com wrote: If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in scaldoc? Maybe we can just fix that w/ Typesafe's help and then we can use them. On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Yeah the fully realized #4, which gets back the ability to use it in switch statements (? in Scala but not Java?) does end up being kind of huge. I confess I'm swayed a bit back to Java enums, seeing what it involves. The hashCode() issue can be 'solved' with the hash of the String representation. On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com wrote: I've just switched some of my code over to the new format, and I just want to make sure everyone realizes what we are getting into. I went from 10 lines as java enums https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20 to 30 lines with the new format: https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250 its not just that its verbose. each name has to be repeated 4 times, with potential typos in some locations that won't be caught by the compiler. Also, you have to manually maintain the values as you update the set of enums, the compiler won't do it for you. The only downside I've heard for java enums is enum.hashcode(). OTOH, the downsides for this version are: maintainability / verbosity, no values(), more cumbersome to use from java, no enum map / enumset. I did put together a little util to at least get back the equivalent of enum.valueOf() with this format https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala I'm not trying to prevent us from moving forward on this, its fine if this is still what everyone wants, but I feel pretty strongly java enums make more sense. thanks, Imran - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: extended jenkins downtime, thursday april 9th 7am-noon PDT (moving to anaconda python more)
and this is now happening. On Tue, Apr 7, 2015 at 4:38 PM, shane knapp skn...@berkeley.edu wrote: reminder! this is happening thurday morning. On Fri, Apr 3, 2015 at 9:59 AM, shane knapp skn...@berkeley.edu wrote: welcome to python2.7+, java 8 and more! :) i'll be doing a major upgrade to our build system next thursday morning. here's a quick list of what's going on: * installation of anaconda python on all worker nodes * installation of pypy 2.5.1 (python 2.7) on all nodes * matching installation of python modules for the current system python (2.6), and anaconda python (2.6, 2.7 and 3.4) - anaconda python 2.7 will be the default for all workers (this has stealthily been the case on amp-jenkins-worker-01 for the past two weeks, and i've noticed no test failures) - you can now use anaconda environments to specify which version of python to use in your tests: http://www.continuum.io/blog/conda * installation of new python 2.7 modules: pymongo requests six pymongo requests six python-crontab * bare-bones mongodb installation on all workers * installation of java 1.6 and 1.8 internal to jenkins - jobs will default to the system java, which is 1.7.0_75 - if you want to run your tests w/java 6 or 8, you can select the JDK version of your choice in the job configuration page (it'll be towards the top) these changes have actually all been tested against a variety of builds (yay staging!) and while i'm certain that i have all of the kinks worked out, i'm going to schedule a longer downtime so that i have a chance to identify and squash any problems that surface. thanks to josh rosen, k. shankari and davies liu for helping me test all of this and get it working. shane