Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Xiangrui Meng
The test accuracy doesn't mean the total loss. All points between (-1,
1) can separate points -1 and +1 and give you 1.0 accuracy, but their
coressponding loss are different. -Xiangrui

On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang yanboha...@gmail.com wrote:
 Hi

 We have used LogisticRegression with two different optimization method SGD
 and LBFGS in MLlib.
 With the same dataset and the same training and test split, but get
 different weights vector.

 For example, we use
 spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training
 and test dataset.
 With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as training
 method and the same other parameters.

 The precisions of these two methods almost near 100% and AUCs are also near
 1.0.
 As far as I know, the convex optimization problem will converge to the
 global minimum value. (We use SGD with mini batch fraction as 1.0)
 But I got two different weights vector? Is this expectation or make sense?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
Thank you for all your patient response.

I can conclude that if the data is totally separable or over-fit occurs,
weights may be different.
And it also consistent with my experiment.

I have evaluate two different dataset and the result as followed:
Loss function: LogisticGradient
Regularizer: L2
regParam: 1.0
numIterations: 1 (SGD)

Dataset 1: spark-1.1.0/data/mllib/sample_binary_classification_data.txt
# of classes: 2
# of samples: 100
# of features: 692
areaUnderROC of both SGD and LBFGS can reach nearly 1.0
Loss function of both optimization method converge
nearly 1.7147811767900675E-5 (very very small)
Weights of each optimization method is different but looks like multiple
relationship (not very strict) just as what DB Tsai mention above.  It
might be the dataset is totally separable.

Dataset 2:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#german.numer
# of classes: 2
# of samples: 1000
# of features: 24
areaUnderROC of both SGD and LBFGS both are nearly 0.8
Loss function of both optimization method converge nearly 0.5367041390107519
Weights of each optimization method is just the same.



2014-09-29 16:05 GMT+08:00 DB Tsai dbt...@dbtsai.com:

 Can you check the loss of both LBFGS and SGD implementation? One
 reason maybe SGD doesn't converge well and you can see that by
 comparing both log-likelihoods. One other potential reason maybe the
 label of your training data is totally separable, so you can always
 increase the log-likelihood by multiply a constant to the weights.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com
 wrote:
  Hi
 
  We have used LogisticRegression with two different optimization method
 SGD
  and LBFGS in MLlib.
  With the same dataset and the same training and test split, but get
  different weights vector.
 
  For example, we use
  spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our
 training
  and test dataset.
  With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as
 training
  method and the same other parameters.
 
  The precisions of these two methods almost near 100% and AUCs are also
 near
  1.0.
  As far as I know, the convex optimization problem will converge to the
  global minimum value. (We use SGD with mini batch fraction as 1.0)
  But I got two different weights vector? Is this expectation or make
 sense?



Re: How to use multi thread in RDD map function ?

2014-09-29 Thread myasuka
Our cluster is a standalone cluster with 16 computing nodes, each node has 16
cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32,
we give 512 tasks all together, this situation can help increase the
concurrency. But if I  set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES
to 16, this dosen't work well.

Thank you for your reply.


Yi Tian wrote
 for yarn-client mode:
  
 SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) *
 TotalCoresOnYourCluster
 
 for standlone mode:
 
 SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) *
 TotalCoresOnYourCluster
 
 
 
 Best Regards,
 
 Yi Tian

 tianyi.asiainfo@

 
 
 
 
 On Sep 28, 2014, at 17:59, myasuka lt;

 myasuka@

 gt; wrote:
 
 Hi, everyone
I come across with a problem about increasing the concurency. In a
 program, after shuffle write, each node should fetch 16 pair matrices to
 do
 matrix multiplication. such as:
 
 *import breeze.linalg.{DenseMatrix = BDM}
 
 pairs.map(t = {
val b1 = t._2._1.asInstanceOf[BDM[Double]]
val b2 = t._2._2.asInstanceOf[BDM[Double]]
 
val c = (b1 * b2).asInstanceOf[BDM[Double]]
 
(new BlockID(t._1.row, t._1.column), c)
  })*
 
Each node has 16 cores. However, no matter I set 16 tasks or more on
 each node, the concurrency cannot be higher than 60%, which means not
 every
 core on the node is computing. Then I check the running log on the WebUI,
 according to the amount of shuffle read and write in every task, I see
 some
 task do once matrix multiplication, some do twice while some do none.
 
Thus, I think of using java multi thread to increase the concurrency.
 I
 wrote a program in scala which calls java multi thread without Spark on a
 single node, by watch the 'top' monitor, I find this program can use CPU
 up
 to 1500% ( means nearly every core are computing). But I have no idea how
 to
 use Java multi thread in RDD transformation.
 
Is there any one can provide some example code to use Java multi
 thread
 in RDD transformation, or give any idea to increase the concurrency ?
 
 Thanks for all
 
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: 

 dev-unsubscribe@.apache

 For additional commands, e-mail: 

 dev-help@.apache







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to use multi thread in RDD map function ?

2014-09-29 Thread Yi Tian
Hi, myasuka

Have you checked the jvm gc time of each executor? 

I think you should increase the SPARK_EXECUTOR_CORES or 
SPARK_EXECUTOR_INSTANCES until you get the enough concurrency.

Here is my recommend config:

SPARK_EXECUTOR_CORES=8
SPARK_EXECUTOR_INSTANCES=4
SPARK_WORKER_MEMORY=8G

note: make sure you got enough memory on each node, more than 
SPARK_EXECUTOR_INSTANCES * SPARK_WORKER_MEMORY

Best Regards,

Yi Tian
tianyi.asiai...@gmail.com




On Sep 29, 2014, at 21:06, myasuka myas...@live.com wrote:

 Our cluster is a standalone cluster with 16 computing nodes, each node has 16
 cores. I set SPARK_WORKER_INSTANCES to 1, and set SPARK_WORKER_CORES to 32,
 we give 512 tasks all together, this situation can help increase the
 concurrency. But if I  set SPARK_WORKER_INSTANCES to 2, SPARK_WORKER_CORES
 to 16, this dosen't work well.
 
 Thank you for your reply.
 
 
 Yi Tian wrote
 for yarn-client mode:
 
 SPARK_EXECUTOR_CORES * SPARK_EXECUTOR_INSTANCES = 2(or 3) *
 TotalCoresOnYourCluster
 
 for standlone mode:
 
 SPARK_WORKER_INSTANCES * SPARK_WORKER_CORES = 2(or 3) *
 TotalCoresOnYourCluster
 
 
 
 Best Regards,
 
 Yi Tian
 
 tianyi.asiainfo@
 
 
 
 
 
 On Sep 28, 2014, at 17:59, myasuka lt;
 
 myasuka@
 
 gt; wrote:
 
 Hi, everyone
   I come across with a problem about increasing the concurency. In a
 program, after shuffle write, each node should fetch 16 pair matrices to
 do
 matrix multiplication. such as:
 
 *import breeze.linalg.{DenseMatrix = BDM}
 
 pairs.map(t = {
   val b1 = t._2._1.asInstanceOf[BDM[Double]]
   val b2 = t._2._2.asInstanceOf[BDM[Double]]
 
   val c = (b1 * b2).asInstanceOf[BDM[Double]]
 
   (new BlockID(t._1.row, t._1.column), c)
 })*
 
   Each node has 16 cores. However, no matter I set 16 tasks or more on
 each node, the concurrency cannot be higher than 60%, which means not
 every
 core on the node is computing. Then I check the running log on the WebUI,
 according to the amount of shuffle read and write in every task, I see
 some
 task do once matrix multiplication, some do twice while some do none.
 
   Thus, I think of using java multi thread to increase the concurrency.
 I
 wrote a program in scala which calls java multi thread without Spark on a
 single node, by watch the 'top' monitor, I find this program can use CPU
 up
 to 1500% ( means nearly every core are computing). But I have no idea how
 to
 use Java multi thread in RDD transformation.
 
   Is there any one can provide some example code to use Java multi
 thread
 in RDD transformation, or give any idea to increase the concurrency ?
 
 Thanks for all
 
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: 
 
 dev-unsubscribe@.apache
 
 For additional commands, e-mail: 
 
 dev-help@.apache
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-use-multi-thread-in-RDD-map-function-tp8583p8594.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



BasicOperationsSuite failing ?

2014-09-29 Thread Ted Yu
Hi,
Running test suite in trunk, I got:

^[[32mBasicOperationsSuite:^[[0m
^[[32m- map^[[0m
^[[32m- flatMap^[[0m
^[[32m- filter^[[0m
^[[32m- glom^[[0m
^[[32m- mapPartitions^[[0m
^[[32m- repartition (more partitions)^[[0m
^[[32m- repartition (fewer partitions)^[[0m
^[[32m- groupByKey^[[0m
^[[32m- reduceByKey^[[0m
^[[32m- reduce^[[0m
^[[32m- count^[[0m
^[[32m- countByValue^[[0m
^[[32m- mapValues^[[0m
^[[32m- flatMapValues^[[0m
^[[32m- union^[[0m
^[[32m- StreamingContext.union^[[0m
^[[32m- transform^[[0m
^[[32m- transformWith^[[0m
^[[32m- StreamingContext.transform^[[0m
^[[32m- cogroup^[[0m
^[[32m- join^[[0m
^[[32m- leftOuterJoin^[[0m
^[[32m- rightOuterJoin^[[0m
^[[32m- fullOuterJoin^[[0m
^[[32m- updateStateByKey^[[0m
^[[32m- updateStateByKey - object lifecycle^[[0m
^[[32m- slice^[[0m
^[[32m- slice - has not been initialized^[[0m
^[[32m- rdd cleanup - map and window^[[0m
^[[32m- rdd cleanup - updateStateByKey^[[0m
^[[31m- rdd cleanup - input blocks and persisted RDDs *** FAILED ***^[[0m
^[[31m  org.scalatest.exceptions.TestFailedException was thrown.
(BasicOperationsSuite.scala:528)^[[0m

However, using sbt for this testsuite, it seemed to pass:

[info] - slice - has not been initialized
[info] - rdd cleanup - map and window
[info] - rdd cleanup - updateStateByKey
Exception in thread Thread-561 org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:701)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:700)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:700)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1406)
at
akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:201)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:163)
at akka.actor.ActorCell.terminate(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:431)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[info] - rdd cleanup - input blocks and persisted RDDs
[info] ScalaTest
[info] Run completed in 1 minute, 1 second.
[info] Total number of tests run: 31
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 31, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 31, Failed 0, Errors 0, Passed 31
java.lang.AssertionError: assertion failed: List(object package$DebugNode,
object package$DebugNode)
at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678)
at
scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:2988)
at
scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:2991)
at
scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1371)
at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:120)
at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1583)
at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1557)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:1553)
at scala.tools.nsc.Global$Run.compile(Global.scala:1662)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:123)
at xsbt.CachedCompiler0.run(CompilerInterface.scala:99)
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)
at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply$mcV$sp(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile$$anonfun$3$$anonfun$compileScala$1$1.apply(AggressiveCompile.scala:99)
at
sbt.compiler.AggressiveCompile.sbt$compiler$AggressiveCompile$$timed(AggressiveCompile.scala:166)
at

jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread shane knapp
happy monday, everyone!

remember a few weeks back when i upgraded jenkins, and unwittingly began
DOSing our system due to massive log spam?

well, that bug has been fixed w/the current release and i'd like to get our
logging levels back to something more verbose that we have now.

downtime will be from 730am-1000am PDT (i do expect this to be done well
before 1000am)

the update will be from 1.578 - 1.582

changelog here:  http://jenkins-ci.org/changelog

please let me know if there are any questions or concerns.  thanks!

shane, your friendly devops engineer


Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread Reynold Xin
Thanks. We might see more failures due to contention on resources. Fingers
acrossed ... At some point it might make sense to run the tests in a VM or
container.


On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote:

 we were running at 8 executors per node, and BARELY even stressing the
 machines (32 cores, ~230G RAM).

 in the interest of actually using system resources, and giving ourselves
 some headroom, i upped the executors to 16 per node.  i'll be keeping an
 eye on ganglia for the rest of the week to make sure everything's cool.

 i hope you all enjoy your freshly allocated capacity!  :)

 shane



Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread shane knapp
yeah, this is why i'm gonna keep a close eye on things this week...

as for VMs vs containers, please do the latter more than the former.  one
of our longer-term plans here at the lab is to move most of our jenkins
infra to VMs, and running tests w/nested VMs is Bad[tm].

On Mon, Sep 29, 2014 at 2:25 PM, Reynold Xin r...@databricks.com wrote:

 Thanks. We might see more failures due to contention on resources. Fingers
 acrossed ... At some point it might make sense to run the tests in a VM or
 container.


 On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote:

 we were running at 8 executors per node, and BARELY even stressing the
 machines (32 cores, ~230G RAM).

 in the interest of actually using system resources, and giving ourselves
 some headroom, i upped the executors to 16 per node.  i'll be keeping an
 eye on ganglia for the rest of the week to make sure everything's cool.

 i hope you all enjoy your freshly allocated capacity!  :)

 shane





Hyper Parameter Optimization Algorithms

2014-09-29 Thread Lochana Menikarachchi

Hi,

Is there anyone who works on hyper parameter optimization algorithms? If 
not, is there any interest on the subject. We are thinking about 
implementing some of these algorithms and contributing to spark? thoughts?


Lochana

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread Nan Zhu
Just noticed these lines in the jenkins log 

= 
Running Apache RAT checks 
= 
Attempting to fetch rat Launching rat from 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar Error: 
Invalid or corrupt jarfile 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar RAT 
checks passed.

Something wrong?

Best, 

-- 
Nan Zhu


On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote:

 happy monday, everyone!
 
 remember a few weeks back when i upgraded jenkins, and unwittingly began
 DOSing our system due to massive log spam?
 
 well, that bug has been fixed w/the current release and i'd like to get our
 logging levels back to something more verbose that we have now.
 
 downtime will be from 730am-1000am PDT (i do expect this to be done well
 before 1000am)
 
 the update will be from 1.578 - 1.582
 
 changelog here: http://jenkins-ci.org/changelog
 
 please let me know if there are any questions or concerns. thanks!
 
 shane, your friendly devops engineer 



Re: Hyper Parameter Optimization Algorithms

2014-09-29 Thread Debasish Das
You should look into Evan Spark's talk from Spark Summit 2014

http://spark-summit.org/2014/talk/model-search-at-scale

I am not sure if some of it is already open sourced through MLBase...

On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi locha...@gmail.com
wrote:

 Hi,

 Is there anyone who works on hyper parameter optimization algorithms? If
 not, is there any interest on the subject. We are thinking about
 implementing some of these algorithms and contributing to spark? thoughts?

 Lochana

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
Hi Haopu,

My understanding is that the hashtable on both left and right side is used
for including null values in result in an efficient manner. If hash table
is only built on one side, let's say left side and we perform a left outer
join, for each row in left side, a scan over the right side is needed to
make sure that no matching tuples for that row on left side.

Hope this helps!
Liquan

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst