Re: Tachyon in Spark

2014-12-12 Thread Jun Feng Liu
I think the linage is the key feature of tachyon to reproduce the RDD when 
any error happen. Otherwise, there have to be some data replica among 
tachyon nodes to ensure the data redundancy for fault tolerant - I think 
tachyon is avoiding to go to this path. Dose it mean the off-heap solution 
is not ready yet if tachyon linage dose not work right now? 
 
Best Regards
 
Jun Feng Liu
IBM China Systems  Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liuj...@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China 
 

 



Reynold Xin r...@databricks.com 
2014/12/12 10:22

To
Andrew Ash and...@andrewash.com, 
cc
Jun Feng Liu/China/IBM@IBMCN, dev@spark.apache.org 
dev@spark.apache.org
Subject
Re: Tachyon in Spark






Actually HY emailed me offline about this and this is supported in the
latest version of Tachyon. It is a hard problem to push this into storage;
need to think about how to handle isolation, resource allocation, etc.

https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java


On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin r...@databricks.com wrote:

 I don't think the lineage thing is even turned on in Tachyon - it was
 mostly a research prototype, so I don't think it'd make sense for us to 
use
 that.


 On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com 
wrote:

 I'm interested in understanding this as well.  One of the main ways
 Tachyon
 is supposed to realize performance gains without sacrificing durability 
is
 by storing the lineage of data rather than full copies of it (similar 
to
 Spark).  But if Spark isn't sending lineage information into Tachyon, 
then
 I'm not sure how this isn't a durability concern.

 On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu liuj...@cn.ibm.com 
wrote:

  Dose Spark today really leverage Tachyon linage to process data? It
 seems
  like the application should call createDependency function in 
TachyonFS
  to create a new linage node. But I did not find any place call that 
in
  Spark code. Did I missed anything?
 
  Best Regards
 
 
  *Jun Feng Liu*
  IBM China Systems  Technology Laboratory in Beijing
 
--
   [image: 2D barcode - encoded with contact information] *Phone:
 *86-10-82452683
 
  * E-mail:* *liuj...@cn.ibm.com* liuj...@cn.ibm.com
  [image: IBM]
 
  BLD 28,ZGC Software Park
  No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
  China
 
 
 
 
 






Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
reminder:  jenkins is going down NOW.

On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds)
 * add new slaves to the master, remove old ones (keep them around just in
 case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane






Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
downtime is extended to 10am PST so that i can finish testing the numpy
upgrade...  besides that, everything looks good and the system updates and
reboots went off w/o a hitch.

shane

On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  jenkins is going down NOW.

 On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
 builds)
 * add new slaves to the master, remove old ones (keep them around just in
 case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu
 wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane







Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
ok, we're back up w/all new jenkins workers.  i'll be keeping an eye on
these pretty closely today for any build failures caused by the new
systems, and if things look bleak, i'll switch back to the original five.

thanks for your patience!

On Fri, Dec 12, 2014 at 8:47 AM, shane knapp skn...@berkeley.edu wrote:

 downtime is extended to 10am PST so that i can finish testing the numpy
 upgrade...  besides that, everything looks good and the system updates and
 reboots went off w/o a hitch.

 shane

 On Fri, Dec 12, 2014 at 7:26 AM, shane knapp skn...@berkeley.edu wrote:

 reminder:  jenkins is going down NOW.

 On Thu, Dec 11, 2014 at 3:08 PM, shane knapp skn...@berkeley.edu wrote:

 here's the plan...  reboots, of course, come last.  :)

 pause build queue at 7am, kill off (and eventually retrigger) any
 stragglers at 8am.  then begin maintenance:

 all systems:
 * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
 amp-jenkins-worker-{01..08})
 * reboots

 jenkins slaves:
 * install python2.7 (along side 2.6, which would remain the default)
 * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
 builds)
 * add new slaves to the master, remove old ones (keep them around just
 in case)

 there will be no jenkins system or plugin upgrades at this time.  things
 there seems to be working just fine!

 i'm expecting to be up and building by 9am at the latest.  i'll update
 this thread w/any new time estimates.

 word.

 shane, your rained-in devops guy :)

 On Wed, Dec 10, 2014 at 11:28 AM, shane knapp skn...@berkeley.edu
 wrote:

 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp skn...@berkeley.edu
 wrote:

 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane








Re: zinc invocation examples

2014-12-12 Thread Patrick Wendell
Hey York - I'm sending some feedback off-list, feel free to open a PR as well.


On Tue, Dec 9, 2014 at 12:05 PM, York, Brennon
brennon.y...@capitalone.com wrote:
 Patrick, I¹ve nearly completed a basic build out for the SPARK-4501 issue
 (at https://github.com/brennonyork/spark/tree/SPARK-4501) and would be
 great to get your initial read on it. Per this thread I need to add in the
 -scala-home call to zinc, but its close to ready for a PR.

 On 12/5/14, 2:10 PM, Patrick Wendell pwend...@gmail.com wrote:

One thing I created a JIRA for a while back was to have a similar
script to sbt/sbt that transparently downloads Zinc, Scala, and
Maven in a subdirectory of Spark and sets it up correctly. I.e.
build/mvn.

Outside of brew for MacOS there aren't good Zinc packages, and it's a
pain to figure out how to set it up.

https://issues.apache.org/jira/browse/SPARK-4501

Prashant Sharma looked at this for a bit but I don't think he's
working on it actively any more, so if someone wanted to do this, I'd
be extremely grateful.

- Patrick

On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
ryan.blake.willi...@gmail.com wrote:
 fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start`
which:

 - starts a nailgun server as well,
 - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
 https://github.com/typesafehub/zinc#scala: If no options are passed
to
 locate a version of Scala then Scala 2.9.2 is used by default (which is
 bundled with zinc).

 The latter seems like it might be especially important.


 On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Oh, derp. I just assumed from looking at all the options that there was
 something to it. Thanks Sean.

 On Thu Dec 04 2014 at 7:47:33 AM Sean Owen so...@cloudera.com wrote:

  You just run it once with zinc -start and leave it running as a
  background process on your build machine. You don't have to do
  anything for each build.
 
  On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   https://github.com/apache/spark/blob/master/docs/
  building-spark.md#speeding-up-compilation-with-zinc
  
   Could someone summarize how they invoke zinc as part of a regular
   build-test-etc. cycle?
  
   I'll add it in to the aforelinked page if appropriate.
  
   Nick
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


 

 The information contained in this e-mail is confidential and/or proprietary 
 to Capital One and/or its affiliates. The information transmitted herewith is 
 intended only for use by the individual or entity to which it is addressed.  
 If the reader of this message is not the intended recipient, you are hereby 
 notified that any review, retransmission, dissemination, distribution, 
 copying or other use of, or taking of any action in reliance upon this 
 information is strictly prohibited. If you have received this communication 
 in error, please contact the sender and delete the material from your 
 computer.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



CrossValidator API in new spark.ml package

2014-12-12 Thread DB Tsai
Hi Xiangrui,

It seems that it's stateless so will be hard to implement
regularization path. Any suggestion to extend it? Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Newest ML-Lib on Spark 1.1

2014-12-12 Thread Ganelin, Ilya
Hi all – we’re running CDH 5.2 and would be interested in having the latest and 
greatest ML Lib version on our cluster (with YARN). Could anyone help me out in 
terms of figuring out what build profiles to use to get this to play well? Will 
I be able to update ML-Lib independently of updating the rest of spark to 1.2 
and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop 
deployment. Alternately, if anyone has managed to get the trunk successfully 
built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some 
help. Thanks!


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
For CDH this works well for me...tested till 5.1...

./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn
-Phive -DskipTests

To build with hive thriftserver support for spark-sql

On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

 Hi all – we’re running CDH 5.2 and would be interested in having the
 latest and greatest ML Lib version on our cluster (with YARN). Could anyone
 help me out in terms of figuring out what build profiles to use to get this
 to play well? Will I be able to update ML-Lib independently of updating the
 rest of spark to 1.2 and beyond? I ran into numerous issues trying to build
 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to
 get the trunk successfully built and tested against Cloudera’s YARN and
 Hadoop for 5.2 I would love some help. Thanks!
 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.



Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Sean Owen
Could you specify what problems you're seeing? there is nothing
special about the CDH distribution at all.

The latest and greatest is 1.1, and that is what is in CDH 5.2. You
can certainly compile even master for CDH and get it to work though.

The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1.

5.3 is just around the corner, and includes 1.2, which is also just
around the corner.

On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
 Hi all – we’re running CDH 5.2 and would be interested in having the latest 
 and greatest ML Lib version on our cluster (with YARN). Could anyone help me 
 out in terms of figuring out what build profiles to use to get this to play 
 well? Will I be able to update ML-Lib independently of updating the rest of 
 spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 
 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get 
 the trunk successfully built and tested against Cloudera’s YARN and Hadoop 
 for 5.2 I would love some help. Thanks!
 

 The information contained in this e-mail is confidential and/or proprietary 
 to Capital One and/or its affiliates. The information transmitted herewith is 
 intended only for use by the individual or entity to which it is addressed.  
 If the reader of this message is not the intended recipient, you are hereby 
 notified that any review, retransmission, dissemination, distribution, 
 copying or other use of, or taking of any action in reliance upon this 
 information is strictly prohibited. If you have received this communication 
 in error, please contact the sender and delete the material from your 
 computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil



We are happy to announce a developer preview of the Spark Kernel which
enables remote applications to dynamically interact with Spark. You can
think of the Spark Kernel as a remote Spark Shell that uses the IPython
notebook interface to provide a common entrypoint for any application. The
Spark Kernel obviates the need to submit jars using spark-submit, and can
replace the existing Spark Shell.

You can try out the Spark Kernel today by installing it from our github
repo at https://github.com/ibm-et/spark-kernel. To help you get a demo
environment up and running quickly, the repository also includes a
Dockerfile and a Vagrantfile to build a Spark Kernel container and connect
to it from an IPython notebook.

We have included a number of documents with the project to help explain it
and provide how-to information:

* A high-level overview of the Spark Kernel and its client library (
https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf
).

* README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) -
building and testing the kernel, and deployment options including building
the Docker container and packaging the kernel.

* IPython instructions (
https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) -
setting up the development version of IPython and connecting a Spark
Kernel.

* Client library tutorial (
https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) -
building and using the client library to connect to a Spark Kernel.

* Magics documentation (
https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the
magics in the kernel and how to write your own.

We think the Spark Kernel will be useful for developing applications for
Spark, and we are making it available with the intention of improving these
capabilities within the context of the Spark community (
https://issues.apache.org/jira/browse/SPARK-4605). We will continue to
develop the codebase and welcome your comments and suggestions.


Signed,

Chip Senkbeil
IBM Emerging Technology Software Engineer

RE: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Ganelin, Ilya
Hi Sean - I should clarify : I was able to build the master but when running I 
hit really random looking protobuf errors (just starting up a spark shell), I 
can try doing a build later today and give the exact stack trace.

I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib is 
much fresher than the one in 1.1 and specifically includes fixed for ALS to 
help it scale better.

I had built with the exact flags you suggested below. After doing so I tried to 
run the test suite and run a spark she'll without success. Might you have any 
other suggestions? Thanks!



Sent with Good (www.good.com)


-Original Message-
From: Sean Owen [so...@cloudera.commailto:so...@cloudera.com]
Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
To: Ganelin, Ilya
Cc: dev
Subject: Re: Newest ML-Lib on Spark 1.1


Could you specify what problems you're seeing? there is nothing
special about the CDH distribution at all.

The latest and greatest is 1.1, and that is what is in CDH 5.2. You
can certainly compile even master for CDH and get it to work though.

The safest build flags should be -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1.

5.3 is just around the corner, and includes 1.2, which is also just
around the corner.

On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
 Hi all – we’re running CDH 5.2 and would be interested in having the latest 
 and greatest ML Lib version on our cluster (with YARN). Could anyone help me 
 out in terms of figuring out what build profiles to use to get this to play 
 well? Will I be able to update ML-Lib independently of updating the rest of 
 spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 
 against CDH’s Hadoop deployment. Alternately, if anyone has managed to get 
 the trunk successfully built and tested against Cloudera’s YARN and Hadoop 
 for 5.2 I would love some help. Thanks!
 

 The information contained in this e-mail is confidential and/or proprietary 
 to Capital One and/or its affiliates. The information transmitted herewith is 
 intended only for use by the individual or entity to which it is addressed.  
 If the reader of this message is not the intended recipient, you are hereby 
 notified that any review, retransmission, dissemination, distribution, 
 copying or other use of, or taking of any action in reliance upon this 
 information is strictly prohibited. If you have received this communication 
 in error, please contact the sender and delete the material from your 
 computer.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Sean Owen
What errors do you see? protobuf errors usually mean you didn't build
for the right version of Hadoop, but if you are using -Phadoop-2.3 or
better -Phadoop-2.4 that should be fine. Yes, a stack trace would be
good. I'm still not sure what error you are seeing.

On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya
ilya.gane...@capitalone.com wrote:
 Hi Sean - I should clarify : I was able to build the master but when running
 I hit really random looking protobuf errors (just starting up a spark
 shell), I can try doing a build later today and give the exact stack trace.

 I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib
 is much fresher than the one in 1.1 and specifically includes fixed for ALS
 to help it scale better.

 I had built with the exact flags you suggested below. After doing so I tried
 to run the test suite and run a spark she'll without success. Might you have
 any other suggestions? Thanks!



 Sent with Good (www.good.com)



 -Original Message-
 From: Sean Owen [so...@cloudera.com]
 Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
 To: Ganelin, Ilya
 Cc: dev
 Subject: Re: Newest ML-Lib on Spark 1.1

 Could you specify what problems you're seeing? there is nothing
 special about the CDH distribution at all.

 The latest and greatest is 1.1, and that is what is in CDH 5.2. You
 can certainly compile even master for CDH and get it to work though.

 The safest build flags should be -Phadoop-2.4
 -Dhadoop.version=2.5.0-cdh5.2.1.

 5.3 is just around the corner, and includes 1.2, which is also just
 around the corner.

 On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
 ilya.gane...@capitalone.com wrote:
 Hi all – we’re running CDH 5.2 and would be interested in having the
 latest and greatest ML Lib version on our cluster (with YARN). Could anyone
 help me out in terms of figuring out what build profiles to use to get this
 to play well? Will I be able to update ML-Lib independently of updating the
 rest of spark to 1.2 and beyond? I ran into numerous issues trying to build
 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to
 get the trunk successfully built and tested against Cloudera’s YARN and
 Hadoop for 5.2 I would love some help. Thanks!
 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity to
 which it is addressed.  If the reader of this message is not the intended
 recipient, you are hereby notified that any review, retransmission,
 dissemination, distribution, copying or other use of, or taking of any
 action in reliance upon this information is strictly prohibited. If you have
 received this communication in error, please contact the sender and delete
 the material from your computer.


 

 The information contained in this e-mail is confidential and/or proprietary
 to Capital One and/or its affiliates. The information transmitted herewith
 is intended only for use by the individual or entity to which it is
 addressed.  If the reader of this message is not the intended recipient, you
 are hereby notified that any review, retransmission, dissemination,
 distribution, copying or other use of, or taking of any action in reliance
 upon this information is strictly prohibited. If you have received this
 communication in error, please contact the sender and delete the material
 from your computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
protobuf comes from missing -Phadoop2.3

On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen so...@cloudera.com wrote:

 What errors do you see? protobuf errors usually mean you didn't build
 for the right version of Hadoop, but if you are using -Phadoop-2.3 or
 better -Phadoop-2.4 that should be fine. Yes, a stack trace would be
 good. I'm still not sure what error you are seeing.

 On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya
 ilya.gane...@capitalone.com wrote:
  Hi Sean - I should clarify : I was able to build the master but when
 running
  I hit really random looking protobuf errors (just starting up a spark
  shell), I can try doing a build later today and give the exact stack
 trace.
 
  I know that 5.2 is running 1.1 but I believe the latest and greatest Ml
 Lib
  is much fresher than the one in 1.1 and specifically includes fixed for
 ALS
  to help it scale better.
 
  I had built with the exact flags you suggested below. After doing so I
 tried
  to run the test suite and run a spark she'll without success. Might you
 have
  any other suggestions? Thanks!
 
 
 
  Sent with Good (www.good.com)
 
 
 
  -Original Message-
  From: Sean Owen [so...@cloudera.com]
  Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
  To: Ganelin, Ilya
  Cc: dev
  Subject: Re: Newest ML-Lib on Spark 1.1
 
  Could you specify what problems you're seeing? there is nothing
  special about the CDH distribution at all.
 
  The latest and greatest is 1.1, and that is what is in CDH 5.2. You
  can certainly compile even master for CDH and get it to work though.
 
  The safest build flags should be -Phadoop-2.4
  -Dhadoop.version=2.5.0-cdh5.2.1.
 
  5.3 is just around the corner, and includes 1.2, which is also just
  around the corner.
 
  On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
  ilya.gane...@capitalone.com wrote:
  Hi all – we’re running CDH 5.2 and would be interested in having the
  latest and greatest ML Lib version on our cluster (with YARN). Could
 anyone
  help me out in terms of figuring out what build profiles to use to get
 this
  to play well? Will I be able to update ML-Lib independently of updating
 the
  rest of spark to 1.2 and beyond? I ran into numerous issues trying to
 build
  1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed
 to
  get the trunk successfully built and tested against Cloudera’s YARN and
  Hadoop for 5.2 I would love some help. Thanks!
  
 
  The information contained in this e-mail is confidential and/or
  proprietary to Capital One and/or its affiliates. The information
  transmitted herewith is intended only for use by the individual or
 entity to
  which it is addressed.  If the reader of this message is not the
 intended
  recipient, you are hereby notified that any review, retransmission,
  dissemination, distribution, copying or other use of, or taking of any
  action in reliance upon this information is strictly prohibited. If you
 have
  received this communication in error, please contact the sender and
 delete
  the material from your computer.
 
 
  
 
  The information contained in this e-mail is confidential and/or
 proprietary
  to Capital One and/or its affiliates. The information transmitted
 herewith
  is intended only for use by the individual or entity to which it is
  addressed.  If the reader of this message is not the intended recipient,
 you
  are hereby notified that any review, retransmission, dissemination,
  distribution, copying or other use of, or taking of any action in
 reliance
  upon this information is strictly prohibited. If you have received this
  communication in error, please contact the sender and delete the material
  from your computer.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: CrossValidator API in new spark.ml package

2014-12-12 Thread DB Tsai
Okay, I got it. In Estimator, fit(dataset: SchemaRDD, paramMaps:
Array[ParamMap]): Seq[M] can be overwritten to implement
regularization path. Correct me if I'm wrong.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Dec 12, 2014 at 11:37 AM, DB Tsai dbt...@dbtsai.com wrote:
 Hi Xiangrui,

 It seems that it's stateless so will be hard to implement
 regularization path. Any suggestion to extend it? Thanks.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil

Hi Sam,

We developed the Spark Kernel with a focus on the newest version of the
IPython message protocol (5.0) for the upcoming IPython 3.0 release.

We are building around Apache Spark's REPL, which is used in the current
Spark Shell implementation.

The Spark Kernel was designed to be extensible through magics (
https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md),
providing functionality that might be needed outside the Scala interpreter.

Finally, a big part of our focus is on application development. Because of
this, we are providing a client library for applications to connect to the
Spark Kernel without needing to implement the ZeroMQ protocol.

Signed,
Chip Senkbeil



From:   Sam Bessalah samkiller@gmail.com
To: Robert C Senkbeil/Austin/IBM@IBMUS
Date:   12/12/2014 04:20 PM
Subject:Re: IBM open-sources Spark Kernel



Wow. Thanks. Can't wait to try this out.
Great job.
How Is it different from Iscala or Ispark?


On Dec 12, 2014 11:17 PM, Robert C Senkbeil rcsen...@us.ibm.com wrote:



  We are happy to announce a developer preview of the Spark Kernel which
  enables remote applications to dynamically interact with Spark. You can
  think of the Spark Kernel as a remote Spark Shell that uses the IPython
  notebook interface to provide a common entrypoint for any application.
  The
  Spark Kernel obviates the need to submit jars using spark-submit, and can
  replace the existing Spark Shell.

  You can try out the Spark Kernel today by installing it from our github
  repo at https://github.com/ibm-et/spark-kernel. To help you get a demo
  environment up and running quickly, the repository also includes a
  Dockerfile and a Vagrantfile to build a Spark Kernel container and
  connect
  to it from an IPython notebook.

  We have included a number of documents with the project to help explain
  it
  and provide how-to information:

  * A high-level overview of the Spark Kernel and its client library (
  
https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf

  ).

  * README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) -
  building and testing the kernel, and deployment options including
  building
  the Docker container and packaging the kernel.

  * IPython instructions (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) -
  setting up the development version of IPython and connecting a Spark
  Kernel.

  * Client library tutorial (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) -
  building and using the client library to connect to a Spark Kernel.

  * Magics documentation (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the
  magics in the kernel and how to write your own.

  We think the Spark Kernel will be useful for developing applications for
  Spark, and we are making it available with the intention of improving
  these
  capabilities within the context of the Spark community (
  https://issues.apache.org/jira/browse/SPARK-4605). We will continue to
  develop the codebase and welcome your comments and suggestions.


  Signed,

  Chip Senkbeil
  IBM Emerging Technology Software Engineer

one hot encoding

2014-12-12 Thread Lochana Menikarachchi
Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't 
available in 1.1.0.

Thanks.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Josh Rosen
+1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice any 
performance regressions that could not be attributed to changes of default 
configurations.  To be more specific, when running Spark 1.2.0 with the Spark 
1.1.0 settings of spark.shuffle.manager=hash and 
spark.shuffle.blockTransferService=nio, there was no performance regression 
and, in fact, there were significant performance improvements for some 
workloads.

In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort and 
spark.shuffle.blockTransferService=netty.  With these new settings, I noticed a 
performance regression in the scala-sort-by-key-int spark-perf test.  However, 
Spark 1.1.0 and 1.1.1 exhibit a similar performance regression for that same 
test when run with spark.shuffle.manager=sort, so this regression seems 
explainable by the change of defaults.  Besides this, most of the other tests 
ran at the same speeds or faster with the new 1.2.0 defaults.  Also, keep in 
mind that this is a somewhat artificial micro benchmark; I have heard anecdotal 
reports from many users that their real workloads have run faster with 1.2.0.

Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.

- Josh

On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote:

+1 (non-binding). Tested on Ubuntu against YARN.  

On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote:  

 +1  
  
 Tested on OS X.  
  
 On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com  
 wrote:  
  
  Please vote on releasing the following candidate as Apache Spark version  
  1.2.0!  
   
  The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):  
   
   
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
   
   
  The release files, including signatures, digests, etc. can be found at:  
  http://people.apache.org/~pwendell/spark-1.2.0-rc2/  
   
  Release artifacts are signed with the following key:  
  https://people.apache.org/keys/committer/pwendell.asc  
   
  The staging repository for this release can be found at:  
  https://repository.apache.org/content/repositories/orgapachespark-1055/  
   
  The documentation corresponding to this release can be found at:  
  http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/  
   
  Please vote on releasing this package as Apache Spark 1.2.0!  
   
  The vote is open until Saturday, December 13, at 21:00 UTC and passes  
  if a majority of at least 3 +1 PMC votes are cast.  
   
  [ ] +1 Release this package as Apache Spark 1.2.0  
  [ ] -1 Do not release this package because ...  
   
  To learn more about Apache Spark, please see  
  http://spark.apache.org/  
   
  == What justifies a -1 vote for this release? ==  
  This vote is happening relatively late into the QA period, so  
  -1 votes should only occur for significant regressions from  
  1.0.2. Bugs already present in 1.1.X, minor  
  regressions, or bugs related to new features will not block this  
  release.  
   
  == What default changes should I be aware of? ==  
  1. The default value of spark.shuffle.blockTransferService has been  
  changed to netty  
  -- Old behavior can be restored by switching to nio  
   
  2. The default value of spark.shuffle.manager has been changed to  
 sort.  
  -- Old behavior can be restored by setting spark.shuffle.manager to  
  hash.  
   
  == How does this differ from RC1 ==  
  This has fixes for a handful of issues identified - some of the  
  notable fixes are:  
   
  [Core]  
  SPARK-4498: Standalone Master can fail to recognize completed/failed  
  applications  
   
  [SQL]  
  SPARK-4552: Query for empty parquet table in spark sql hive get  
  IllegalArgumentException  
  SPARK-4753: Parquet2 does not prune based on OR filters on partition  
  columns  
  SPARK-4761: With JDBC server, set Kryo as default serializer and  
  disable reference tracking  
  SPARK-4785: When called with arguments referring column fields, PMOD  
  throws NPE  
   
  - Patrick  
   
  -  
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;  
  For additional commands, e-mail: dev-h...@spark.apache.org  
 javascript:;  
   
   
  


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Mark Hamstra
+1

On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen rosenvi...@gmail.com wrote:

 +1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice
 any performance regressions that could not be attributed to changes of
 default configurations.  To be more specific, when running Spark 1.2.0 with
 the Spark 1.1.0 settings of spark.shuffle.manager=hash and
 spark.shuffle.blockTransferService=nio, there was no performance regression
 and, in fact, there were significant performance improvements for some
 workloads.

 In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort
 and spark.shuffle.blockTransferService=netty.  With these new settings, I
 noticed a performance regression in the scala-sort-by-key-int spark-perf
 test.  However, Spark 1.1.0 and 1.1.1 exhibit a similar performance
 regression for that same test when run with spark.shuffle.manager=sort, so
 this regression seems explainable by the change of defaults.  Besides this,
 most of the other tests ran at the same speeds or faster with the new 1.2.0
 defaults.  Also, keep in mind that this is a somewhat artificial micro
 benchmark; I have heard anecdotal reports from many users that their real
 workloads have run faster with 1.2.0.

 Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.

 - Josh

 On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com)
 wrote:

 +1 (non-binding). Tested on Ubuntu against YARN.

 On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com wrote:

  +1
 
  Tested on OS X.
 
  On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Please vote on releasing the following candidate as Apache Spark
 version
   1.2.0!
  
   The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.2.0-rc2/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1055/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
  
   Please vote on releasing this package as Apache Spark 1.2.0!
  
   The vote is open until Saturday, December 13, at 21:00 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.2.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == What justifies a -1 vote for this release? ==
   This vote is happening relatively late into the QA period, so
   -1 votes should only occur for significant regressions from
   1.0.2. Bugs already present in 1.1.X, minor
   regressions, or bugs related to new features will not block this
   release.
  
   == What default changes should I be aware of? ==
   1. The default value of spark.shuffle.blockTransferService has been
   changed to netty
   -- Old behavior can be restored by switching to nio
  
   2. The default value of spark.shuffle.manager has been changed to
  sort.
   -- Old behavior can be restored by setting spark.shuffle.manager to
   hash.
  
   == How does this differ from RC1 ==
   This has fixes for a handful of issues identified - some of the
   notable fixes are:
  
   [Core]
   SPARK-4498: Standalone Master can fail to recognize completed/failed
   applications
  
   [SQL]
   SPARK-4552: Query for empty parquet table in spark sql hive get
   IllegalArgumentException
   SPARK-4753: Parquet2 does not prune based on OR filters on partition
   columns
   SPARK-4761: With JDBC server, set Kryo as default serializer and
   disable reference tracking
   SPARK-4785: When called with arguments referring column fields, PMOD
   throws NPE
  
   - Patrick
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:;
   For additional commands, e-mail: dev-h...@spark.apache.org
  javascript:;
  
  
 



Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Denny Lee
+1 Tested on OSX

Tested Scala 2.10.3, SparkSQL with Hive 0.12 / Hadoop 2.5, Thrift Server,
MLLib SVD


On Fri Dec 12 2014 at 8:57:16 PM Mark Hamstra m...@clearstorydata.com
wrote:

 +1

 On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen rosenvi...@gmail.com wrote:
 
  +1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice
  any performance regressions that could not be attributed to changes of
  default configurations.  To be more specific, when running Spark 1.2.0
 with
  the Spark 1.1.0 settings of spark.shuffle.manager=hash and
  spark.shuffle.blockTransferService=nio, there was no performance
 regression
  and, in fact, there were significant performance improvements for some
  workloads.
 
  In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort
  and spark.shuffle.blockTransferService=netty.  With these new settings,
 I
  noticed a performance regression in the scala-sort-by-key-int spark-perf
  test.  However, Spark 1.1.0 and 1.1.1 exhibit a similar performance
  regression for that same test when run with spark.shuffle.manager=sort,
 so
  this regression seems explainable by the change of defaults.  Besides
 this,
  most of the other tests ran at the same speeds or faster with the new
 1.2.0
  defaults.  Also, keep in mind that this is a somewhat artificial micro
  benchmark; I have heard anecdotal reports from many users that their real
  workloads have run faster with 1.2.0.
 
  Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.
 
  - Josh
 
  On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com)
  wrote:
 
  +1 (non-binding). Tested on Ubuntu against YARN.
 
  On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin r...@databricks.com
 wrote:
 
   +1
  
   Tested on OS X.
  
   On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com
   wrote:
  
Please vote on releasing the following candidate as Apache Spark
  version
1.2.0!
   
The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
   
   
  
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 a428c446e23e628b746e0626cc02b7b3cadf588e
   
The release files, including signatures, digests, etc. can be found
 at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
   
  https://repository.apache.org/content/repositories/orgapachespark-1055/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
   
Please vote on releasing this package as Apache Spark 1.2.0!
   
The vote is open until Saturday, December 13, at 21:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 1.2.0
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.apache.org/
   
== What justifies a -1 vote for this release? ==
This vote is happening relatively late into the QA period, so
-1 votes should only occur for significant regressions from
1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.
   
== What default changes should I be aware of? ==
1. The default value of spark.shuffle.blockTransferService has
 been
changed to netty
-- Old behavior can be restored by switching to nio
   
2. The default value of spark.shuffle.manager has been changed to
   sort.
-- Old behavior can be restored by setting spark.shuffle.manager
 to
hash.
   
== How does this differ from RC1 ==
This has fixes for a handful of issues identified - some of the
notable fixes are:
   
[Core]
SPARK-4498: Standalone Master can fail to recognize completed/failed
applications
   
[SQL]
SPARK-4552: Query for empty parquet table in spark sql hive get
IllegalArgumentException
SPARK-4753: Parquet2 does not prune based on OR filters on partition
columns
SPARK-4761: With JDBC server, set Kryo as default serializer and
disable reference tracking
SPARK-4785: When called with arguments referring column fields, PMOD
throws NPE
   
- Patrick
   

 -
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  javascript:;
For additional commands, e-mail: dev-h...@spark.apache.org
   javascript:;