Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.
The test accuracy doesn't mean the total loss. All points between (-1, 1) can separate points -1 and +1 and give you 1.0 accuracy, but their coressponding loss are different. -Xiangrui On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and test dataset. With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as training method and the same other parameters. The precisions of these two methods almost near 100% and AUCs are also near 1.0. As far as I know, the convex optimization problem will converge to the global minimum value. (We use SGD with mini batch fraction as 1.0) But I got two different weights vector? Is this expectation or make sense? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.
Thank you for all your patient response. I can conclude that if the data is totally separable or over-fit occurs, weights may be different. And it also consistent with my experiment. I have evaluate two different dataset and the result as followed: Loss function: LogisticGradient Regularizer: L2 regParam: 1.0 numIterations: 1 (SGD) Dataset 1: spark-1.1.0/data/mllib/sample_binary_classification_data.txt # of classes: 2 # of samples: 100 # of features: 692 areaUnderROC of both SGD and LBFGS can reach nearly 1.0 Loss function of both optimization method converge nearly 1.7147811767900675E-5 (very very small) Weights of each optimization method is different but looks like multiple relationship (not very strict) just as what DB Tsai mention above. It might be the dataset is totally separable. Dataset 2: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#german.numer # of classes: 2 # of samples: 1000 # of features: 24 areaUnderROC of both SGD and LBFGS both are nearly 0.8 Loss function of both optimization method converge nearly 0.5367041390107519 Weights of each optimization method is just the same. 2014-09-29 16:05 GMT+08:00 DB Tsai dbt...@dbtsai.com: Can you check the loss of both LBFGS and SGD implementation? One reason maybe SGD doesn't converge well and you can see that by comparing both log-likelihoods. One other potential reason maybe the label of your training data is totally separable, so you can always increase the log-likelihood by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and test dataset. With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as training method and the same other parameters. The precisions of these two methods almost near 100% and AUCs are also near 1.0. As far as I know, the convex optimization problem will converge to the global minimum value. (We use SGD with mini batch fraction as 1.0) But I got two different weights vector? Is this expectation or make sense?
Re: How to use multi thread in RDD map function ?
Our cluster is a standalone cluster with 16 computing nodes, each node has 16 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32, we give 512 tasks all together, this situation can help increase the concurrency. But if I set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES to 16, this dosen't work well. Thank you for your reply. Yi Tian wrote for yarn-client mode: SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) * TotalCoresOnYourCluster for standlone mode: SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) * TotalCoresOnYourCluster Best Regards, Yi Tian tianyi.asiainfo@ On Sep 28, 2014, at 17:59, myasuka lt; myasuka@ gt; wrote: Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: *import breeze.linalg.{DenseMatrix = BDM} pairs.map(t = { val b1 = t._2._1.asInstanceOf[BDM[Double]] val b2 = t._2._2.asInstanceOf[BDM[Double]] val c = (b1 * b2).asInstanceOf[BDM[Double]] (new BlockID(t._1.row, t._1.column), c) })* Each node has 16 cores. However, no matter I set 16 tasks or more on each node, the concurrency cannot be higher than 60%, which means not every core on the node is computing. Then I check the running log on the WebUI, according to the amount of shuffle read and write in every task, I see some task do once matrix multiplication, some do twice while some do none. Thus, I think of using java multi thread to increase the concurrency. I wrote a program in scala which calls java multi thread without Spark on a single node, by watch the 'top' monitor, I find this program can use CPU up to 1500% ( means nearly every core are computing). But I have no idea how to use Java multi thread in RDD transformation. Is there any one can provide some example code to use Java multi thread in RDD transformation, or give any idea to increase the concurrency ? Thanks for all -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscribe@.apache For additional commands, e-mail: dev-help@.apache -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: How to use multi thread in RDD map function ?
Hi, myasuka Have you checked the jvm gc time of each executor? I think you should increase the SPARK_EXECUTOR_CORES or SPARK_EXECUTOR_INSTANCES until you get the enough concurrency. Here is my recommend config: SPARK_EXECUTOR_CORES=8 SPARK_EXECUTOR_INSTANCES=4 SPARK_WORKER_MEMORY=8G note: make sure you got enough memory on each node, more than SPARK_EXECUTOR_INSTANCES * SPARK_WORKER_MEMORY Best Regards, Yi Tian tianyi.asiai...@gmail.com On Sep 29, 2014, at 21:06, myasuka myas...@live.com wrote: Our cluster is a standalone cluster with 16 computing nodes, each node has 16 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32, we give 512 tasks all together, this situation can help increase the concurrency. But if I set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES to 16, this dosen't work well. Thank you for your reply. Yi Tian wrote for yarn-client mode: SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) * TotalCoresOnYourCluster for standlone mode: SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) * TotalCoresOnYourCluster Best Regards, Yi Tian tianyi.asiainfo@ On Sep 28, 2014, at 17:59, myasuka lt; myasuka@ gt; wrote: Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: *import breeze.linalg.{DenseMatrix = BDM} pairs.map(t = { val b1 = t._2._1.asInstanceOf[BDM[Double]] val b2 = t._2._2.asInstanceOf[BDM[Double]] val c = (b1 * b2).asInstanceOf[BDM[Double]] (new BlockID(t._1.row, t._1.column), c) })* Each node has 16 cores. However, no matter I set 16 tasks or more on each node, the concurrency cannot be higher than 60%, which means not every core on the node is computing. Then I check the running log on the WebUI, according to the amount of shuffle read and write in every task, I see some task do once matrix multiplication, some do twice while some do none. Thus, I think of using java multi thread to increase the concurrency. I wrote a program in scala which calls java multi thread without Spark on a single node, by watch the 'top' monitor, I find this program can use CPU up to 1500% ( means nearly every core are computing). But I have no idea how to use Java multi thread in RDD transformation. Is there any one can provide some example code to use Java multi thread in RDD transformation, or give any idea to increase the concurrency ? Thanks for all -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscribe@.apache For additional commands, e-mail: dev-help@.apache -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
BasicOperationsSuite failing ?
Hi, Running test suite in trunk, I got: ^[[32mBasicOperationsSuite:^[[0m ^[[32m- map^[[0m ^[[32m- flatMap^[[0m ^[[32m- filter^[[0m ^[[32m- glom^[[0m ^[[32m- mapPartitions^[[0m ^[[32m- repartition (more partitions)^[[0m ^[[32m- repartition (fewer partitions)^[[0m ^[[32m- groupByKey^[[0m ^[[32m- reduceByKey^[[0m ^[[32m- reduce^[[0m ^[[32m- count^[[0m ^[[32m- countByValue^[[0m ^[[32m- mapValues^[[0m ^[[32m- flatMapValues^[[0m ^[[32m- union^[[0m ^[[32m- StreamingContext.union^[[0m ^[[32m- transform^[[0m ^[[32m- transformWith^[[0m ^[[32m- StreamingContext.transform^[[0m ^[[32m- cogroup^[[0m ^[[32m- join^[[0m ^[[32m- leftOuterJoin^[[0m ^[[32m- rightOuterJoin^[[0m ^[[32m- fullOuterJoin^[[0m ^[[32m- updateStateByKey^[[0m ^[[32m- updateStateByKey - object lifecycle^[[0m ^[[32m- slice^[[0m ^[[32m- slice - has not been initialized^[[0m ^[[32m- rdd cleanup - map and window^[[0m ^[[32m- rdd cleanup - updateStateByKey^[[0m ^[[31m- rdd cleanup - input blocks and persisted RDDs *** FAILED ***^[[0m ^[[31m org.scalatest.exceptions.TestFailedException was thrown. (BasicOperationsSuite.scala:528)^[[0m However, using sbt for this testsuite, it seemed to pass: [info] - slice - has not been initialized [info] - rdd cleanup - map and window [info] - rdd cleanup - updateStateByKey Exception in thread Thread-561 org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:700) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:700) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1406) at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201) at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163) at akka.actor.ActorCell.terminate(ActorCell.scala:338) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262) at akka.dispatch.Mailbox.run(Mailbox.scala:218) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [info] - rdd cleanup - input blocks and persisted RDDs [info] ScalaTest [info] Run completed in 1 minute, 1 second. [info] Total number of tests run: 31 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 31, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 31, Failed 0, Errors 0, Passed 31 java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678) at scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988) at scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:2991) at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1371) at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120) at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583) at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557) at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553) at scala.tools.nsc.Global$Run.compile(Global.scala:1662) at xsbt.CachedCompiler0.run(CompilerInterface.scala:123) at xsbt.CachedCompiler0.run(CompilerInterface.scala:99) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41) at sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:99) at sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99) at sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99) at sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:166) at
jenkins downtime/system upgrade wednesday morning, 730am PDT
happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be from 730am-1000am PDT (i do expect this to be done well before 1000am) the update will be from 1.578 - 1.582 changelog here: http://jenkins-ci.org/changelog please let me know if there are any questions or concerns. thanks! shane, your friendly devops engineer
Re: FYI: i've doubled the jenkins executors for every build node
Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote: we were running at 8 executors per node, and BARELY even stressing the machines (32 cores, ~230G RAM). in the interest of actually using system resources, and giving ourselves some headroom, i upped the executors to 16 per node. i'll be keeping an eye on ganglia for the rest of the week to make sure everything's cool. i hope you all enjoy your freshly allocated capacity! :) shane
Re: FYI: i've doubled the jenkins executors for every build node
yeah, this is why i'm gonna keep a close eye on things this week... as for VMs vs containers, please do the latter more than the former. one of our longer-term plans here at the lab is to move most of our jenkins infra to VMs, and running tests w/nested VMs is Bad[tm]. On Mon, Sep 29, 2014 at 2:25 PM, Reynold Xin r...@databricks.com wrote: Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote: we were running at 8 executors per node, and BARELY even stressing the machines (32 cores, ~230G RAM). in the interest of actually using system resources, and giving ourselves some headroom, i upped the executors to 16 per node. i'll be keeping an eye on ganglia for the rest of the week to make sure everything's cool. i hope you all enjoy your freshly allocated capacity! :) shane
Hyper Parameter Optimization Algorithms
Hi, Is there anyone who works on hyper parameter optimization algorithms? If not, is there any interest on the subject. We are thinking about implementing some of these algorithms and contributing to spark? thoughts? Lochana - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: jenkins downtime/system upgrade wednesday morning, 730am PDT
Just noticed these lines in the jenkins log = Running Apache RAT checks = Attempting to fetch rat Launching rat from /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar Error: Invalid or corrupt jarfile /home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar RAT checks passed. Something wrong? Best, -- Nan Zhu On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote: happy monday, everyone! remember a few weeks back when i upgraded jenkins, and unwittingly began DOSing our system due to massive log spam? well, that bug has been fixed w/the current release and i'd like to get our logging levels back to something more verbose that we have now. downtime will be from 730am-1000am PDT (i do expect this to be done well before 1000am) the update will be from 1.578 - 1.582 changelog here: http://jenkins-ci.org/changelog please let me know if there are any questions or concerns. thanks! shane, your friendly devops engineer
Re: Hyper Parameter Optimization Algorithms
You should look into Evan Spark's talk from Spark Summit 2014 http://spark-summit.org/2014/talk/model-search-at-scale I am not sure if some of it is already open sourced through MLBase... On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi locha...@gmail.com wrote: Hi, Is there anyone who works on hyper parameter optimization algorithms? If not, is there any interest on the subject. We are thinking about implementing some of these algorithms and contributing to spark? thoughts? Lochana - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark SQL question: why build hashtable for both sides in HashOuterJoin?
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?
Hi Haopu, My understanding is that the hashtable on both left and right side is used for including null values in result in an efficient manner. If hash table is only built on one side, let's say left side and we perform a left outer join, for each row in left side, a scan over the right side is needed to make sure that no matching tuples for that row on left side. Hope this helps! Liquan On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote: I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst