Linear Regression with SGD
Hi User group, We are using spark Linear Regression with SGD as the optimization technique and we are achieving very sub-optimal results. Can anyone shed some light on why this implementation seems to produce such poor results vs our own implementation? We are using a very small dataset, but we have to use a very large number of iterations to achieve similar results to our implementation, we’ve tried normalizing the data not normalizing the data and tuning every param. Our implementation is a closed form solution so we should be guaranteed convergence but the spark one is not, which is understandable, but why is it so far off? Has anyone experienced this? Steve Carman, M.S. Artificial Intelligence Engineer Coldlight-PTC scar...@coldlight.com This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.
Where does partitioning and data loading happen?
A colleague and I were having a discussion and we were disagreeing about something in Spark/Mesos that perhaps someone can shed some light into. We have a mesos cluster that runs spark via a sparkHome, rather than downloading an executable and such. My colleague says that say we have parquet files in S3, that slaves should know what data is in their partition and only pull from the S3 the partitions of parquet data they need, but this seems inherinitly wrong to me. as I have no idea how it’s possible for Spark or Mesos to know what partitions to know what to pull on the slave. It makes much more sense to me for the partitioning to be done on the driver and then distributed to the slaves so the slaves don’t have to necessarily worry about these details. If this were the case there is some data loading that is done on the driver, correct? Or does spark/mesos do some magic to pass a reference so the slaves know what to pull per say? So I guess in summation, where does partitioning and data loading happen? On the driver or on the executor? Thanks, Steve This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.
RE: swap tuple
Yea, I wouldn't try and modify the current since RDDs are suppose to be immutable, just create a new one... val newRdd = oldRdd.map(r = (r._2(), r._1())) or something of that nature... Steve From: Evo Eftimov [evo.efti...@isecc.com] Sent: Thursday, May 14, 2015 1:24 PM To: 'Holden Karau'; 'Yasemin Kaya' Cc: user@spark.apache.org Subject: RE: swap tuple Where is the “Tuple” supposed to be in String, String - you can refer to a “Tuple” if it was e.g. String, Tuple2String, String From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thursday, May 14, 2015 5:56 PM To: Yasemin Kaya Cc: user@spark.apache.org Subject: Re: swap tuple Can you paste your code? transformations return a new RDD rather than modifying an existing one, so if you were to swap the values of the tuple using a map you would get back a new RDD and then you would want to try and print this new RDD instead of the original one. On Thursday, May 14, 2015, Yasemin Kaya godo...@gmail.commailto:godo...@gmail.com wrote: Hi, I have JavaPairRDDString, String and I want to swap tuple._1() to tuple._2(). I use tuple.swap() but it can't be changed JavaPairRDD in real. When I print JavaPairRDD, the values are same. Anyone can help me for that? Thank you. Have nice day. yasemin -- hiç ender hiç -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.
Re: Spark on Mesos
Sander, I eventually solved this problem via the --[no-]switch_user flag, which is set to true by default. I set this to false, which would have the user that owns the process run the job, otherwise it was my username (scarman) running the job, which would fail because obviously my username didn’t exist there. When ran as root, it ran totally fine with no problems what so ever. Hopefully this works for you too, Steve On May 13, 2015, at 11:45 AM, Sander van Dijk sgvand...@gmail.com wrote: Hey all, I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz prebuilt package, and Mesos installed from the Mesosphere repositories. I have been running with Spark standalone successfully for a while and now trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports all slaves alive. I then run Spark shell with: `spark-shell --master mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which starts up fine, with output: I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with 20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO mesos.MesosSchedulerBackend: Registered as framework ID 20150512-150459-2618695596-5050-3956-0009 and the framework shows up in the Mesos UI. Then when trying to run something (e.g. 'val rdd = sc.txtFile(path); rdd.count') fails with lost executors. In /var/log/mesos-slave.ERROR on the slave instances there are entries like: E0513 14:57:01.198995 13077 slave.cpp:3112] Container 'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor '20150512-145720-2618695596-5050-3082-S10' of framework '20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to execute mesos-fetcher: Failed to chown work directory From what I can find, the work directory is in /tmp/mesos, where indeed I see a directory structure with executor and framework IDs, with at the leaves stdout and stderr files of size 0. Everything there is owned by root, but I assume the processes are also run by root, so any chowning in there should be possible. I was thinking maybe it fails to fetch the Spark package executor? I uploaded spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set in spark-env.sh, and in the Environment section of the web UI I see this picked up in the spark.executor.uriparameter. I checked and the URI is reachable by the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is successful. Any pointers? Many thanks, Sander On Fri, May 1, 2015 at 8:35 AM Tim Chen t...@mesosphere.io wrote: Hi Stephen, It looks like Mesos slave was most likely not able to launch some mesos helper processes (fetcher probably?). How did you install Mesos? Did you build from source yourself? Please install Mesos through a package or actually from source run make install and run from the installed binary. Tim On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman scar...@coldlight.com wrote: So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine… Screen Shot 2015-04-27 at 10.30.01 AM.png The framework is registered in Mesos, it shows up just fine, it doesn’t matter if I turn off the executor uri or not, but I always get the same error… org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) These boxes
Re: Spark on Mesos
Yup, exactly as Tim mentioned on it too. I went back and tried what you just suggested and that was also perfectly fine. Steve On May 13, 2015, at 1:58 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos scheduler will pick up your local user and tries to impersonate as the same user and chown the directory before executing any task. If you try to run Spark driver as root it should resolve the problem. No switch user can also work as it won't try to switch user for you. Tim On Wed, May 13, 2015 at 10:50 AM, Stephen Carman scar...@coldlight.commailto:scar...@coldlight.com wrote: Sander, I eventually solved this problem via the --[no-]switch_user flag, which is set to true by default. I set this to false, which would have the user that owns the process run the job, otherwise it was my username (scarman) running the job, which would fail because obviously my username didn’t exist there. When ran as root, it ran totally fine with no problems what so ever. Hopefully this works for you too, Steve On May 13, 2015, at 11:45 AM, Sander van Dijk sgvand...@gmail.commailto:sgvand...@gmail.com wrote: Hey all, I seem to be experiencing the same thing as Stephen. I run Spark 1.2.1 with Mesos 0.22.1, with Spark coming from the spark-1.2.1-bin-hadoop2.4.tgz prebuilt package, and Mesos installed from the Mesosphere repositories. I have been running with Spark standalone successfully for a while and now trying to setup Mesos. Mesos is up and running, the UI at port 5050 reports all slaves alive. I then run Spark shell with: `spark-shell --master mesos://1.1.1.1:5050` (with 1.1.1.1 the master's ip address), which starts up fine, with output: I0513 15:02:45.340287 28804 sched.cpp:448] Framework registered with 20150512-150459-2618695596-5050-3956-0009 15/05/13 15:02:45 INFO mesos.MesosSchedulerBackend: Registered as framework ID 20150512-150459-2618695596-5050-3956-0009 and the framework shows up in the Mesos UI. Then when trying to run something (e.g. 'val rdd = sc.txtFile(path); rdd.count') fails with lost executors. In /var/log/mesos-slave.ERROR on the slave instances there are entries like: E0513 14:57:01.198995 13077 slave.cpp:3112] Container 'eaf33d36-dde5-498a-9ef1-70138810a38c' for executor '20150512-145720-2618695596-5050-3082-S10' of framework '20150512-150459-2618695596-5050-3956-0009' failed to start: Failed to execute mesos-fetcher: Failed to chown work directory From what I can find, the work directory is in /tmp/mesos, where indeed I see a directory structure with executor and framework IDs, with at the leaves stdout and stderr files of size 0. Everything there is owned by root, but I assume the processes are also run by root, so any chowning in there should be possible. I was thinking maybe it fails to fetch the Spark package executor? I uploaded spark-1.2.1-bin-hadoop2.4.tgz to hdfs, SPARK_EXECUTOR_URI is set in spark-env.sh, and in the Environment section of the web UI I see this picked up in the spark.executor.uriparameter. I checked and the URI is reachable by the slaves: an `hdfs dfs -stat $SPARK_EXECUTOR_URI` is successful. Any pointers? Many thanks, Sander On Fri, May 1, 2015 at 8:35 AM Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Stephen, It looks like Mesos slave was most likely not able to launch some mesos helper processes (fetcher probably?). How did you install Mesos? Did you build from source yourself? Please install Mesos through a package or actually from source run make install and run from the installed binary. Tim On Mon, Apr 27, 2015 at 11:11 AM, Stephen Carman scar...@coldlight.commailto:scar...@coldlight.com wrote: So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine… Screen Shot 2015-04-27 at 10.30.01 AM.png The framework is registered in Mesos, it shows up just fine, it doesn’t matter if I turn off the executor uri or not, but I always get the same error… org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.orghttp://org.apache.spark.scheduler.dagscheduler.org/$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192
Re: Spark on Mesos
So I installed spark on each of the slaves 1.3.1 built with hadoop2.6 I just basically got the pre-built from the spark website… I placed those compiled spark installs on each slave at /opt/spark My spark properties seem to be getting picked up on my side fine… [cid:683C1BA0-C9EC-448C-B1DB-E93AC4576DE9@coldlight.corp] The framework is registered in Mesos, it shows up just fine, it doesn’t matter if I turn off the executor uri or not, but I always get the same error… org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) These boxes are totally open to one another so they shouldn’t have any firewall issues, everything seems to show up in mesos and spark just fine, but actually running stuff totally blows up. There is nothing in the stderr or stdout, it downloads the package and untars it but doesn’t seem to do much after that. Any insights? Steve On Apr 24, 2015, at 5:50 PM, Yang Lei genia...@gmail.commailto:genia...@gmail.com wrote: SPARK_PUBLIC_DNS, SPARK_LOCAL_IP, SPARK_LOCAL_HOST This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.
Spark on Mesos
So I can’t for the life of me to get something even simple working for Spark on Mesos. I installed a 3 master, 3 slave mesos cluster, which is all configured, but I can’t for the life of me even get the spark shell to work properly. I get errors like this org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) I tried both mesos 0.21 and 0.22 and they both produce the same error… My version of spark is 1.3.1 with hadoop 2.6, I just downloaded the pre-build from the site, or is that wrong and i have to build it myself? I have my mesos_native_java_library, spark executor URI and mesos master set in my spark-env.sh, they to the best of my abilities seem correct. Does anyone have any insight into this at all? I’m running this on red hat 7 with 8 CPU cores and 14gb of ram per slave, so 24 cores total and 42gb of ram total. Anyone have any idea at all what is going on here? Thanks, Steve This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Memory Utilities
I noticed spark has some nice memory tracking estimators in it, but they are private. We have some custom implementations of RDD and PairRDD to suit our internal needs and it’d be fantastic if we’d be able to just leverage the memory estimates that already exist in Spark. Is there any change they can be made public inside the library or have some interface to them such that children classes can make use of them? Thanks, Stephen Carman, M.S. AI Engineer, Coldlight Solutions, LLC Cell - 267 240 0363 This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org