Re: Tachyon in Spark
I think the linage is the key feature of tachyon to reproduce the RDD when any error happen. Otherwise, there have to be some data replica among tachyon nodes to ensure the data redundancy for fault tolerant - I think tachyon is avoiding to go to this path. Dose it mean the off-heap solution is not ready yet if tachyon linage dose not work right now? Best Regards Jun Feng Liu IBM China Systems Technology Laboratory in Beijing Phone: 86-10-82452683 E-mail: liuj...@cn.ibm.com BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China Reynold Xin r...@databricks.com 2014/12/12 10:22 To Andrew Ash and...@andrewash.com, cc Jun Feng Liu/China/IBM@IBMCN, dev@spark.apache.org dev@spark.apache.org Subject Re: Tachyon in Spark Actually HY emailed me offline about this and this is supported in the latest version of Tachyon. It is a hard problem to push this into storage; need to think about how to handle isolation, resource allocation, etc. https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin r...@databricks.com wrote: I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote: I'm interested in understanding this as well. One of the main ways Tachyon is supposed to realize performance gains without sacrificing durability is by storing the lineage of data rather than full copies of it (similar to Spark). But if Spark isn't sending lineage information into Tachyon, then I'm not sure how this isn't a durability concern. On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Dose Spark today really leverage Tachyon linage to process data? It seems like the application should call createDependency function in TachyonFS to create a new linage node. But I did not find any place call that in Spark code. Did I missed anything? Best Regards *Jun Feng Liu* IBM China Systems Technology Laboratory in Beijing -- [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683 * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com [image: IBM] BLD 28,ZGC Software Park No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 China
Re: jenkins downtime: 730-930am, 12/12/14
reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
downtime is extended to 10am PST so that i can finish testing the numpy upgrade... besides that, everything looks good and the system updates and reboots went off w/o a hitch. shane On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote: reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: jenkins downtime: 730-930am, 12/12/14
ok, we're back up w/all new jenkins workers. i'll be keeping an eye on these pretty closely today for any build failures caused by the new systems, and if things look bleak, i'll switch back to the original five. thanks for your patience! On Fri, Dec 12, 2014 at 8:47 AM, shane knapp skn...@berkeley.edu wrote: downtime is extended to 10am PST so that i can finish testing the numpy upgrade... besides that, everything looks good and the system updates and reboots went off w/o a hitch. shane On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote: reminder: jenkins is going down NOW. On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote: here's the plan... reboots, of course, come last. :) pause build queue at 7am, kill off (and eventually retrigger) any stragglers at 8am. then begin maintenance: all systems: * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05}, amp-jenkins-worker-{01..08}) * reboots jenkins slaves: * install python2.7 (along side 2.6, which would remain the default) * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds) * add new slaves to the master, remove old ones (keep them around just in case) there will be no jenkins system or plugin upgrades at this time. things there seems to be working just fine! i'm expecting to be up and building by 9am at the latest. i'll update this thread w/any new time estimates. word. shane, your rained-in devops guy :) On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote: reminder -- this is happening friday morning @ 730am! On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote: i'll send out a reminder next week, but i wanted to give a heads up: i'll be bringing down the entire jenkins infrastructure for reboots and system updates. please let me know if there are any conflicts with this, thanks! shane
Re: zinc invocation examples
Hey York - I'm sending some feedback off-list, feel free to open a PR as well. On Tue, Dec 9, 2014 at 12:05 PM, York, Brennon brennon.y...@capitalone.com wrote: Patrick, I¹ve nearly completed a basic build out for the SPARK-4501 issue (at https://github.com/brennonyork/spark/tree/SPARK-4501) and would be great to get your initial read on it. Per this thread I need to add in the -scala-home call to zinc, but its close to ready for a PR. On 12/5/14, 2:10 PM, Patrick Wendell pwend...@gmail.com wrote: One thing I created a JIRA for a while back was to have a similar script to sbt/sbt that transparently downloads Zinc, Scala, and Maven in a subdirectory of Spark and sets it up correctly. I.e. build/mvn. Outside of brew for MacOS there aren't good Zinc packages, and it's a pain to figure out how to set it up. https://issues.apache.org/jira/browse/SPARK-4501 Prashant Sharma looked at this for a bit but I don't think he's working on it actively any more, so if someone wanted to do this, I'd be extremely grateful. - Patrick On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams ryan.blake.willi...@gmail.com wrote: fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start` which: - starts a nailgun server as well, - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2 https://github.com/typesafehub/zinc#scala: If no options are passed to locate a version of Scala then Scala 2.9.2 is used by default (which is bundled with zinc). The latter seems like it might be especially important. On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Oh, derp. I just assumed from looking at all the options that there was something to it. Thanks Sean. On Thu Dec 04 2014 at 7:47:33 AM Sean Owen so...@cloudera.com wrote: You just run it once with zinc -start and leave it running as a background process on your build machine. You don't have to do anything for each build. On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: https://github.com/apache/spark/blob/master/docs/ building-spark.md#speeding-up-compilation-with-zinc Could someone summarize how they invoke zinc as part of a regular build-test-etc. cycle? I'll add it in to the aforelinked page if appropriate. Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
CrossValidator API in new spark.ml package
Hi Xiangrui, It seems that it's stateless so will be hard to implement regularization path. Any suggestion to extend it? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Newest ML-Lib on Spark 1.1
Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Newest ML-Lib on Spark 1.1
For CDH this works well for me...tested till 5.1... ./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -Phive -DskipTests To build with hive thriftserver support for spark-sql On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Newest ML-Lib on Spark 1.1
Could you specify what problems you're seeing? there is nothing special about the CDH distribution at all. The latest and greatest is 1.1, and that is what is in CDH 5.2. You can certainly compile even master for CDH and get it to work though. The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1. 5.3 is just around the corner, and includes 1.2, which is also just around the corner. On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
IBM open-sources Spark Kernel
We are happy to announce a developer preview of the Spark Kernel which enables remote applications to dynamically interact with Spark. You can think of the Spark Kernel as a remote Spark Shell that uses the IPython notebook interface to provide a common entrypoint for any application. The Spark Kernel obviates the need to submit jars using spark-submit, and can replace the existing Spark Shell. You can try out the Spark Kernel today by installing it from our github repo at https://github.com/ibm-et/spark-kernel. To help you get a demo environment up and running quickly, the repository also includes a Dockerfile and a Vagrantfile to build a Spark Kernel container and connect to it from an IPython notebook. We have included a number of documents with the project to help explain it and provide how-to information: * A high-level overview of the Spark Kernel and its client library ( https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf ). * README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) - building and testing the kernel, and deployment options including building the Docker container and packaging the kernel. * IPython instructions ( https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) - setting up the development version of IPython and connecting a Spark Kernel. * Client library tutorial ( https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) - building and using the client library to connect to a Spark Kernel. * Magics documentation ( https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the magics in the kernel and how to write your own. We think the Spark Kernel will be useful for developing applications for Spark, and we are making it available with the intention of improving these capabilities within the context of the Spark community ( https://issues.apache.org/jira/browse/SPARK-4605). We will continue to develop the codebase and welcome your comments and suggestions. Signed, Chip Senkbeil IBM Emerging Technology Software Engineer
RE: Newest ML-Lib on Spark 1.1
Hi Sean - I should clarify : I was able to build the master but when running I hit really random looking protobuf errors (just starting up a spark shell), I can try doing a build later today and give the exact stack trace. I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib is much fresher than the one in 1.1 and specifically includes fixed for ALS to help it scale better. I had built with the exact flags you suggested below. After doing so I tried to run the test suite and run a spark she'll without success. Might you have any other suggestions? Thanks! Sent with Good (www.good.com) -Original Message- From: Sean Owen [so...@cloudera.commailto:so...@cloudera.com] Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time To: Ganelin, Ilya Cc: dev Subject: Re: Newest ML-Lib on Spark 1.1 Could you specify what problems you're seeing? there is nothing special about the CDH distribution at all. The latest and greatest is 1.1, and that is what is in CDH 5.2. You can certainly compile even master for CDH and get it to work though. The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1. 5.3 is just around the corner, and includes 1.2, which is also just around the corner. On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Newest ML-Lib on Spark 1.1
What errors do you see? protobuf errors usually mean you didn't build for the right version of Hadoop, but if you are using -Phadoop-2.3 or better -Phadoop-2.4 that should be fine. Yes, a stack trace would be good. I'm still not sure what error you are seeing. On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi Sean - I should clarify : I was able to build the master but when running I hit really random looking protobuf errors (just starting up a spark shell), I can try doing a build later today and give the exact stack trace. I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib is much fresher than the one in 1.1 and specifically includes fixed for ALS to help it scale better. I had built with the exact flags you suggested below. After doing so I tried to run the test suite and run a spark she'll without success. Might you have any other suggestions? Thanks! Sent with Good (www.good.com) -Original Message- From: Sean Owen [so...@cloudera.com] Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time To: Ganelin, Ilya Cc: dev Subject: Re: Newest ML-Lib on Spark 1.1 Could you specify what problems you're seeing? there is nothing special about the CDH distribution at all. The latest and greatest is 1.1, and that is what is in CDH 5.2. You can certainly compile even master for CDH and get it to work though. The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1. 5.3 is just around the corner, and includes 1.2, which is also just around the corner. On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Newest ML-Lib on Spark 1.1
protobuf comes from missing -Phadoop2.3 On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen so...@cloudera.com wrote: What errors do you see? protobuf errors usually mean you didn't build for the right version of Hadoop, but if you are using -Phadoop-2.3 or better -Phadoop-2.4 that should be fine. Yes, a stack trace would be good. I'm still not sure what error you are seeing. On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi Sean - I should clarify : I was able to build the master but when running I hit really random looking protobuf errors (just starting up a spark shell), I can try doing a build later today and give the exact stack trace. I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib is much fresher than the one in 1.1 and specifically includes fixed for ALS to help it scale better. I had built with the exact flags you suggested below. After doing so I tried to run the test suite and run a spark she'll without success. Might you have any other suggestions? Thanks! Sent with Good (www.good.com) -Original Message- From: Sean Owen [so...@cloudera.com] Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time To: Ganelin, Ilya Cc: dev Subject: Re: Newest ML-Lib on Spark 1.1 Could you specify what problems you're seeing? there is nothing special about the CDH distribution at all. The latest and greatest is 1.1, and that is what is in CDH 5.2. You can certainly compile even master for CDH and get it to work though. The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1. 5.3 is just around the corner, and includes 1.2, which is also just around the corner. On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Hi all – we’re running CDH 5.2 and would be interested in having the latest and greatest ML Lib version on our cluster (with YARN). Could anyone help me out in terms of figuring out what build profiles to use to get this to play well? Will I be able to update ML-Lib independently of updating the rest of spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get the trunk successfully built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some help. Thanks! The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: CrossValidator API in new spark.ml package
Okay, I got it. In Estimator, fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Seq[M] can be overwritten to implement regularization path. Correct me if I'm wrong. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 12, 2014 at 11:37 AM, DB Tsai dbt...@dbtsai.com wrote: Hi Xiangrui, It seems that it's stateless so will be hard to implement regularization path. Any suggestion to extend it? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: IBM open-sources Spark Kernel
Hi Sam, We developed the Spark Kernel with a focus on the newest version of the IPython message protocol (5.0) for the upcoming IPython 3.0 release. We are building around Apache Spark's REPL, which is used in the current Spark Shell implementation. The Spark Kernel was designed to be extensible through magics ( https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md), providing functionality that might be needed outside the Scala interpreter. Finally, a big part of our focus is on application development. Because of this, we are providing a client library for applications to connect to the Spark Kernel without needing to implement the ZeroMQ protocol. Signed, Chip Senkbeil From: Sam Bessalah samkiller@gmail.com To: Robert C Senkbeil/Austin/IBM@IBMUS Date: 12/12/2014 04:20 PM Subject:Re: IBM open-sources Spark Kernel Wow. Thanks. Can't wait to try this out. Great job. How Is it different from Iscala or Ispark? On Dec 12, 2014 11:17 PM, Robert C Senkbeil rcsen...@us.ibm.com wrote: We are happy to announce a developer preview of the Spark Kernel which enables remote applications to dynamically interact with Spark. You can think of the Spark Kernel as a remote Spark Shell that uses the IPython notebook interface to provide a common entrypoint for any application. The Spark Kernel obviates the need to submit jars using spark-submit, and can replace the existing Spark Shell. You can try out the Spark Kernel today by installing it from our github repo at https://github.com/ibm-et/spark-kernel. To help you get a demo environment up and running quickly, the repository also includes a Dockerfile and a Vagrantfile to build a Spark Kernel container and connect to it from an IPython notebook. We have included a number of documents with the project to help explain it and provide how-to information: * A high-level overview of the Spark Kernel and its client library ( https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf ). * README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) - building and testing the kernel, and deployment options including building the Docker container and packaging the kernel. * IPython instructions ( https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) - setting up the development version of IPython and connecting a Spark Kernel. * Client library tutorial ( https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) - building and using the client library to connect to a Spark Kernel. * Magics documentation ( https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the magics in the kernel and how to write your own. We think the Spark Kernel will be useful for developing applications for Spark, and we are making it available with the intention of improving these capabilities within the context of the Spark community ( https://issues.apache.org/jira/browse/SPARK-4605). We will continue to develop the codebase and welcome your comments and suggestions. Signed, Chip Senkbeil IBM Emerging Technology Software Engineer
one hot encoding
Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't available in 1.1.0. Thanks. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
+1. Tested using spark-perf and the Spark EC2 scripts. I didn’t notice any performance regressions that could not be attributed to changes of default configurations. To be more specific, when running Spark 1.2.0 with the Spark 1.1.0 settings of spark.shuffle.manager=hash and spark.shuffle.blockTransferService=nio, there was no performance regression and, in fact, there were significant performance improvements for some workloads. In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort and spark.shuffle.blockTransferService=netty. With these new settings, I noticed a performance regression in the scala-sort-by-key-int spark-perf test. However, Spark 1.1.0 and 1.1.1 exhibit a similar performance regression for that same test when run with spark.shuffle.manager=sort, so this regression seems explainable by the change of defaults. Besides this, most of the other tests ran at the same speeds or faster with the new 1.2.0 defaults. Also, keep in mind that this is a somewhat artificial micro benchmark; I have heard anecdotal reports from many users that their real workloads have run faster with 1.2.0. Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2. - Josh On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: +1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote: +1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1055/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Saturday, December 13, at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening relatively late into the QA period, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == How does this differ from RC1 == This has fixes for a handful of issues identified - some of the notable fixes are: [Core] SPARK-4498: Standalone Master can fail to recognize completed/failed applications [SQL] SPARK-4552: Query for empty parquet table in spark sql hive get IllegalArgumentException SPARK-4753: Parquet2 does not prune based on OR filters on partition columns SPARK-4761: With JDBC server, set Kryo as default serializer and disable reference tracking SPARK-4785: When called with arguments referring column fields, PMOD throws NPE - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
+1 On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen rosenvi...@gmail.com wrote: +1. Tested using spark-perf and the Spark EC2 scripts. I didn’t notice any performance regressions that could not be attributed to changes of default configurations. To be more specific, when running Spark 1.2.0 with the Spark 1.1.0 settings of spark.shuffle.manager=hash and spark.shuffle.blockTransferService=nio, there was no performance regression and, in fact, there were significant performance improvements for some workloads. In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort and spark.shuffle.blockTransferService=netty. With these new settings, I noticed a performance regression in the scala-sort-by-key-int spark-perf test. However, Spark 1.1.0 and 1.1.1 exhibit a similar performance regression for that same test when run with spark.shuffle.manager=sort, so this regression seems explainable by the change of defaults. Besides this, most of the other tests ran at the same speeds or faster with the new 1.2.0 defaults. Also, keep in mind that this is a somewhat artificial micro benchmark; I have heard anecdotal reports from many users that their real workloads have run faster with 1.2.0. Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2. - Josh On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: +1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote: +1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1055/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Saturday, December 13, at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening relatively late into the QA period, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == How does this differ from RC1 == This has fixes for a handful of issues identified - some of the notable fixes are: [Core] SPARK-4498: Standalone Master can fail to recognize completed/failed applications [SQL] SPARK-4552: Query for empty parquet table in spark sql hive get IllegalArgumentException SPARK-4753: Parquet2 does not prune based on OR filters on partition columns SPARK-4761: With JDBC server, set Kryo as default serializer and disable reference tracking SPARK-4785: When called with arguments referring column fields, PMOD throws NPE - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
+1 Tested on OSX Tested Scala 2.10.3, SparkSQL with Hive 0.12 / Hadoop 2.5, Thrift Server, MLLib SVD On Fri Dec 12 2014 at 8:57:16 PM Mark Hamstra m...@clearstorydata.com wrote: +1 On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen rosenvi...@gmail.com wrote: +1. Tested using spark-perf and the Spark EC2 scripts. I didn’t notice any performance regressions that could not be attributed to changes of default configurations. To be more specific, when running Spark 1.2.0 with the Spark 1.1.0 settings of spark.shuffle.manager=hash and spark.shuffle.blockTransferService=nio, there was no performance regression and, in fact, there were significant performance improvements for some workloads. In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort and spark.shuffle.blockTransferService=netty. With these new settings, I noticed a performance regression in the scala-sort-by-key-int spark-perf test. However, Spark 1.1.0 and 1.1.1 exhibit a similar performance regression for that same test when run with spark.shuffle.manager=sort, so this regression seems explainable by the change of defaults. Besides this, most of the other tests ran at the same speeds or faster with the new 1.2.0 defaults. Also, keep in mind that this is a somewhat artificial micro benchmark; I have heard anecdotal reports from many users that their real workloads have run faster with 1.2.0. Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2. - Josh On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote: +1 (non-binding). Tested on Ubuntu against YARN. On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote: +1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= a428c446e23e628b746e0626cc02b7b3cadf588e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1055/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Saturday, December 13, at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening relatively late into the QA period, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == How does this differ from RC1 == This has fixes for a handful of issues identified - some of the notable fixes are: [Core] SPARK-4498: Standalone Master can fail to recognize completed/failed applications [SQL] SPARK-4552: Query for empty parquet table in spark sql hive get IllegalArgumentException SPARK-4753: Parquet2 does not prune based on OR filters on partition columns SPARK-4761: With JDBC server, set Kryo as default serializer and disable reference tracking SPARK-4785: When called with arguments referring column fields, PMOD throws NPE - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;