Re: Kafka client - specify offsets?

2014-06-16 Thread Tobias Pfeiffer
Hi,

there are apparently helpers to tell you the offsets
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads,
but I have no idea how to pass that to the Kafka stream consumer. I am
interested in that as well.

Tobias

On Thu, Jun 12, 2014 at 5:53 AM, Michael Campbell
michael.campb...@gmail.com wrote:
 Is there a way in the Apache Spark Kafka Utils to specify an offset to start
 reading?  Specifically, from the start of the queue, or failing that, a
 specific point?


Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Wei Da
Hi guys,
We are making choices between C++ MPI and Spark. Is there any official
comparation between them? Thanks a lot!

Wei


Spark streaming with Redis? Working with large number of model objects at spark compute nodes.

2014-06-16 Thread tnegi
We are creating a real-time stream processing system with spark streaming
which uses large number (millions)  of analytic models applied to RDDs in
the many different type of streams. Since we do not know which spark node 
will process specific RDDs , we need to make these models available at each
Spark compute node. We are planning to use Redis as in-memory cache over
Spark cluster to feed these models to the Spark compute nodes. Is it the
right approach? We can not cache all models locally at all Spark compute
nodes.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-with-Redis-Working-with-large-number-of-model-objects-at-spark-compute-nodes-tp7663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


What is the best way to handle transformations or actions that takes forever?

2014-06-16 Thread Peng Cheng
My transformations or actions has some external tool set dependencies and
sometimes they just stuck somewhere and there is no way I can fix them. If I
don't want the job to run forever, Do I need to implement several monitor
threads to throws an exception when they stuck. Or the framework can already
handle that?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best-way-to-handle-transformations-or-actions-that-takes-forever-tp7664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: wholeTextFiles not working with HDFS

2014-06-16 Thread littlebird
Hi, I have the same exception. Can you tell me how did you fix it? Thank you!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Michael Cutler
Hello Wei,

I talk from experience of writing many HPC distributed application using
Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
Virtual Machine (PVM) way before that back in the 90's.  I can say with
absolute certainty:

*Any gains you believe there are because C++ is faster than Java/Scala
will be completely blown by the inordinate amount of time you spend
debugging your code and/or reinventing the wheel to do even basic tasks
like linear regression.*


There are undoubtably some very specialised use-cases where MPI and its
brethren still dominate for High Performance Computing tasks -- like for
example the nuclear decay simulations run by the US Department of Energy on
supercomputers where they've invested billions solving that use case.

Spark is part of the wider Big Data ecosystem, and its biggest advantages
are traction amongst internet scale companies, hundreds of developers
contributing to it and a community of thousands using it.

Need a distributed fault-tolerant file system? Use HDFS.  Need a
distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
between your worker processes? Use Zookeeper.  Need to run it on a flexible
grid of computing resources and handle failures? Run it on Mesos!

The barrier to entry to get going with Spark is very low, download the
latest distribution and start the Spark shell.  Language bindings for Scala
/ Java / Python are excellent meaning you spend less time writing
boilerplate code, and more time solving problems.

Even if you believe you *need* to use native code to do something specific,
like fetching HD video frames from satellite video capture cards -- wrap it
in a small native library and use the Java Native Access interface to call
it from your Java/Scala code.

Have fun, and if you get stuck we're here to help!

MC


On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote:

 Hi guys,
 We are making choices between C++ MPI and Spark. Is there any official
 comparation between them? Thanks a lot!

 Wei



Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
Hi,

I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat.
It went all well on my laptop (Accumulo MockInstance + Spark Local mode).

But when I try to submit it to the yarn cluster, the yarn logs shows the
following error message:

14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler:
YarnClientClusterScheduler.postStartHook done
Exception in thread main java.lang.NoSuchMethodError:
org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String;
at
org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127)
at
org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92)
at
com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44)
at
com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Looks like Accumulo's dependency has got problems.

Does Anyone know what's wrong with my code or settings? I've added all
needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar
has been uploaded to hdfs.

14/06/16 04:36:02 INFO yarn.Client: Uploading
file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to
hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar



And here's my spark-submit cmd (all JARs needed are concatenated after
--jars):

~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class
com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars
`find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4
--executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar

I've tried both cluster mode and client mode and neither worked.


BTW, I tried to use sbt-assembly to created a bundled jar, however I always
got the following error:

[error] (*:assembly) deduplicate: different file contents found in the
following:
[error]
/Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
[error]
/Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA

I googled it and looks like I need to exclude some JARs. Anyone has done
that? Your help is really appreciated.



Cheers,

-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Akhil Das
Hi

Check in your driver programs Environment, (eg:
http://192.168.1.39:4040/environment/). If you don't see this
commons-codec-1.7.jar jar then that's the issue.

Thanks
Best Regards


On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat.
 It went all well on my laptop (Accumulo MockInstance + Spark Local mode).

 But when I try to submit it to the yarn cluster, the yarn logs shows the
 following error message:

 14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler:
 YarnClientClusterScheduler.postStartHook done
 Exception in thread main java.lang.NoSuchMethodError:
 org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String;
 at
 org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127)
 at
 org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92)
 at
 com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44)
 at
 com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 Looks like Accumulo's dependency has got problems.

 Does Anyone know what's wrong with my code or settings? I've added all
 needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar
 has been uploaded to hdfs.

 14/06/16 04:36:02 INFO yarn.Client: Uploading
 file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to
 hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar



 And here's my spark-submit cmd (all JARs needed are concatenated after
 --jars):

 ~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class
 com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars
 `find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4
 --executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar

 I've tried both cluster mode and client mode and neither worked.


 BTW, I tried to use sbt-assembly to created a bundled jar, however I
 always got the following error:

 [error] (*:assembly) deduplicate: different file contents found in the
 following:
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA

 I googled it and looks like I need to exclude some JARs. Anyone has done
 that? Your help is really appreciated.



 Cheers,

 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
If you don't want to refactor your code, you can put your input into a test
file. After the test runs, read the data from the output file you specified
(probably want this to be a temp file and delete on exit). Of course, that
is not really a unit test - Metei's suggestion is preferable (this is how
we test). However, if you have a long and complex flow, you might unit test
different parts, and then have an integration test which reads from the
files and tests the whole flow together (I do this as well).




On Fri, Jun 13, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 You need to factor your program so that it’s not just a main(). This is
 not a Spark-specific issue, it’s about how you’d unit test any program in
 general. In this case, your main() creates a SparkContext, so you can’t
 pass one from outside, and your code has to read data from a file and write
 it to a file. It would be better to move your code for transforming data
 into a new function:

 def processData(lines: RDD[String]): RDD[String] = {
   // build and return your “res” variable
 }

 Then you can unit-test this directly on data you create in your program:

 val myLines = sc.parallelize(Seq(“line 1”, “line 2”))
 val result = GetInfo.processData(myLines).collect()
 assert(result.toSet === Set(“res 1”, “res 2”))

 Matei

 On Jun 13, 2014, at 2:42 PM, SK skrishna...@gmail.com wrote:

  Hi,
 
  I have looked through some of the  test examples and also the brief
  documentation on unit testing at
  http://spark.apache.org/docs/latest/programming-guide.html#unit-testing,
 but
  still dont have a good understanding of writing unit tests using the
 Spark
  framework. Previously, I have written unit tests using specs2 framework
 and
  have got them to work in Scalding.  I tried to use the specs2 framework
 with
  Spark, but could not find any simple examples I could follow. I am open
 to
  specs2 or Funsuite, whichever works best with Spark. I would like some
  additional guidance, or some simple sample code using specs2 or
 Funsuite. My
  code is provided below.
 
 
  I have the following code in src/main/scala/GetInfo.scala. It reads a
 Json
  file and extracts some data. It takes the input file (args(0)) and output
  file (args(1)) as arguments.
 
  object GetInfo{
 
def main(args: Array[String]) {
  val inp_file = args(0)
  val conf = new SparkConf().setAppName(GetInfo)
  val sc = new SparkContext(conf)
  val res = sc.textFile(log_file)
.map(line = { parse(line) })
.map(json =
   {
  implicit lazy val formats =
  org.json4s.DefaultFormats
  val aid = (json \ d \ TypeID).extract[Int]
  val ts = (json \ d \ TimeStamp).extract[Long]
  val gid = (json \ d \ ID).extract[String]
  (aid, ts, gid)
   }
 )
.groupBy(tup = tup._3)
.sortByKey(true)
.map(g = (g._1, g._2.map(_._2).max))
  res.map(tuple= %s, %d.format(tuple._1,
  tuple._2)).saveAsTextFile(args(1))
  }
 
 
  I would like to test the above code. My unit test is in src/test/scala.
 The
  code I have so far for the unit test appears below:
 
  import org.apache.spark._
  import org.specs2.mutable._
 
  class GetInfoTest extends Specification with java.io.Serializable{
 
  val data = List (
   (d: {TypeID = 10, Timestamp: 1234, ID: ID1}),
   (d: {TypeID = 11, Timestamp: 5678, ID: ID1}),
   (d: {TypeID = 10, Timestamp: 1357, ID: ID2}),
   (d: {TypeID = 11, Timestamp: 2468, ID: ID2})
 )
 
  val expected_out = List(
 (ID1,5678),
 (ID2,2468),
  )
 
 A GetInfo job should {
  //* How do I pass data define above as input and output
  which GetInfo expects as arguments? **
  val sc = new SparkContext(local, GetInfo)
 
  //*** how do I get the output ***
 
   //assuming out_buffer has the output I want to match it to
 the
  expected output
  match expected output in {
   ( out_buffer == expected_out) must beTrue
  }
  }
 
  }
 
  I would like some help with the tasks marked with  in the unit test
  code above. If specs2 is not the right way to go, I am also open to
  FunSuite. I would like to know how to pass the input while calling my
  program from the unit test and get the output.
 
  Thanks for your help.
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
Thanks you all for advice including (1) using CMS GC (2) use multiple 
worker instance and (3) use Tachyon.

I will try (1) and (2) first and report back what I found.

I will also try JDK 7 with G1 GC.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Aaron Davidson ilike...@gmail.com
To: user@spark.apache.org, 
Date:   06/15/2014 09:06 PM
Subject:Re: long GC pause during file.cache()



Note also that Java does not work well with very large JVMs due to this 
exact issue. There are two commonly used workarounds:

1) Spawn multiple (smaller) executors on the same machine. This can be 
done by creating multiple Workers (via SPARK_WORKER_INSTANCES in 
standalone mode[1]).
2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to 
be smaller and avoid GC pauses

[1] See standalone documentation here: 
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts


On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Yes, I think in the spark-env.sh.template, it is listed in the comments 
(didn’t check….) 

Best,

-- 
Nan Zhu

On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?



On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you 
don’t mind the WARNING in the logs

you can set spark.executor.extraJavaOpts in your SparkConf obj

Best,

-- 
Nan Zhu

On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:
Hi, Wei

You may try to set JVM opts in spark-env.sh as follow to prevent or 
mitigate GC pause:

export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
-Xmx2g -XX:MaxPermSize=256m

There are more options you could add, please just Google :) 


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote:
Hi, 

  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh 

SPARK_WORKER_MEMORY=180g 
SPARK_MEM=180g 


 In spark-shell I have a program like this: 

val file = sc.textFile(/localpath) //file size is 40G 
file.cache() 


val output = file.map(line = extract something from line) 

output.saveAsTextFile (...) 


When I run this program again and again, or keep trying file.unpersist() 
-- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
from the stage monitoring GUI I observe big GC pause (some can be 10+ 
min). Of course when run-time is normal, say ~1 min, no significant GC 
is observed. The behavior seems somewhat random. 

Is there any JVM tuning I should do to prevent this long GC pause from 
happening? 



I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
like this: 

root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 
-XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan





-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hira...@velos.io
W: www.velos.io






Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
BTW: nowadays a single machine with huge RAM (200G to 1T) is really 
common. With virtualization you lose some performance. It would be ideal 
to see some best practice on how to use Spark in these state-of-art 
machines...

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Wei Tan/Watson/IBM@IBMUS
To: user@spark.apache.org, 
Date:   06/16/2014 10:47 AM
Subject:Re: long GC pause during file.cache()



Thanks you all for advice including (1) using CMS GC (2) use multiple 
worker instance and (3) use Tachyon. 

I will try (1) and (2) first and report back what I found. 

I will also try JDK 7 with G1 GC. 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 



From:Aaron Davidson ilike...@gmail.com 
To:user@spark.apache.org, 
Date:06/15/2014 09:06 PM 
Subject:Re: long GC pause during file.cache() 



Note also that Java does not work well with very large JVMs due to this 
exact issue. There are two commonly used workarounds: 

1) Spawn multiple (smaller) executors on the same machine. This can be 
done by creating multiple Workers (via SPARK_WORKER_INSTANCES in 
standalone mode[1]). 
2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to 
be smaller and avoid GC pauses 

[1] See standalone documentation here: 
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
 



On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote: 
Yes, I think in the spark-env.sh.template, it is listed in the comments 
(didn’t check….) 

Best, 

--  
Nan Zhu 
On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: 
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? 



On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: 
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t 
mind the WARNING in the logs 

you can set spark.executor.extraJavaOpts in your SparkConf obj 

Best, 

-- 
Nan Zhu 
On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: 
Hi, Wei 

You may try to set JVM opts in spark-env.sh as follow to prevent or 
mitigate GC pause: 

export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
-Xmx2g -XX:MaxPermSize=256m 

There are more options you could add, please just Google :) 


Regards, 
Wang Hao(王灏) 

CloudTeam | School of Software Engineering 
Shanghai Jiao Tong University 
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 
Email:wh.s...@gmail.com 


On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote: 
Hi, 

  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh 

SPARK_WORKER_MEMORY=180g 
SPARK_MEM=180g 


 In spark-shell I have a program like this: 

val file = sc.textFile(/localpath) //file size is 40G 
file.cache() 


val output = file.map(line = extract something from line) 

output.saveAsTextFile (...) 


When I run this program again and again, or keep trying file.unpersist() 
-- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
from the stage monitoring GUI I observe big GC pause (some can be 10+ 
min). Of course when run-time is normal, say ~1 min, no significant GC 
is observed. The behavior seems somewhat random. 

Is there any JVM tuning I should do to prevent this long GC pause from 
happening? 



I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
like this: 

root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 
-XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 





-- 
 
SUREN HIRAMAN, VP TECHNOLOGY
Velos 
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105 
F: 646.349.4063
E: suren.hira...@velos.io
W: www.velos.io






pyspark regression results way off

2014-06-16 Thread jamborta
Hi all,

I am testing the regression methods (SGD) using pyspark. Tried to tune the
parameters, but they are far off from the results obtained using R. Is there
some way to set these parameters more efficiently?

thanks,  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark regression results way off

2014-06-16 Thread jamborta
forgot to mention that I'm running spark 1.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Memory footprint of Calliope: Spark - Cassandra writes

2014-06-16 Thread Gerard Maas
Hi,

I've been doing some testing with Calliope as a way to do batch load from
Spark into Cassandra.
My initial results are promising on the performance area, but worrisome on
the memory footprint side.

I'm generating N records of about 50 bytes each and using the UPDATE
mutator to insert them into C*.   I get OOM if my memory is below 1GB per
million of records, or about 50Mb of raw data (without counting any
RDD/structural overhead).  (See code [1])

(so, to avoid confusions: e.g.: I need 4GB RAM to save  4M of 50Byte
records to Cassandra)  That's an order of magnitude more than the RAW data.

I understood that Calliope builds on top of the Hadoop support of
Cassandra, which builds on top of SSTables and sstableloader.

I would like to know what's the memory usage factor of Calliope and what
parameters could I use to control/tune that.

Any experience/advice on that?

-kr, Gerard.

[1] https://gist.github.com/maasg/68de6016bffe5e71b78c


RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Hi Xiangrui,

Thank you for the reply! I have tried customizing 
LogisticRegressionSGD.optimizer as in the example you mentioned, but the source 
code reveals that the intercept is also penalized if one is included, which is 
usually inappropriate. The developer should fix this problem.

Best,

Congrui

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Friday, June 13, 2014 11:50 PM
To: user@spark.apache.org
Cc: user
Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic 
Regression

1. 
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
contains example code that shows how to set regParam.

2. A static method with more than 3 parameters becomes hard to
remember and hard to maintain. Please use LogistricRegressionWithSGD's
default constructor and setters.

-Xiangrui


Need some Streaming help

2014-06-16 Thread Yana Kadiyska
Like many people, I'm trying to do hourly counts. The twist is that I don't
want to count per hour of streaming, but per hour of the actual occurrence
of the event (wall clock, say -mm-dd HH).

My thought is to make the streaming window large enough that a full hour of
streaming data would fit inside it. Since my window slides in small
increments, I want to drop the lowest hour from the stream before
persisting the results(since it would have been reduced during the previous
batch and would be a partial count in the current). I have gotten this far

Every line of the input files is parsed into Event(type, hour), and stream
is a DStream[RDD[Event]]

val evtCountsByHour =
  stream.map(evt = (evt, 1))
.reduceByKeyAndWindow(_+_, Seconds(secondsInWindow)) //hourly
counts per event
.mapPartitions(iter = iter.map(x=(x._1.hour,x)))

My understanding is that at this point, the event counts are keyed by hour.

1. How do I detect the smallest key? I have seen some examples of
partitionBy + mapPartitionsWithIndex and dropping the lowest index but
can't figure out how to do it with a DStream. My gut feeling is that the
first RDD in the stream has to contain the oldest data but that doesn't
seem to be the case(printed from inside evtCountsByHour.foreachRDD)

2. If someone is further ahead with this type of problem, could you give
some insight on how you approached it -- I think Streaming would be the
correct approach since I don't really want to worry about data that was
already processed and I want to process it continuously.  I opted on
reduceByKeyAndWindow with a large window as opposed to updateStateByKey as
the hour the event occurred in is part of the key and I don't care to keep
around that key once the next hour's events are coming in (I'm assuming
RDDs outside the window are considered unreferenced). But I'd love to hear
other suggestions if my logic is off.


Fwd: spark streaming questions

2014-06-16 Thread Chen Song
Hey

I am new to spark streaming and apologize if these questions have been
asked.

* In StreamingContext, reduceByKey() seems to only work on the RDDs of the
current batch interval, not including RDDs of previous batches. Is my
understanding correct?

* If the above statement is correct, what functions to use if one wants to
do processing on the continuous stream batches of data? I see 2 functions,
reduceByKeyAndWindow and updateStateByKey which serve this purpose.

My use case is an aggregation and doesn't fit a windowing scenario.

* As for updateStateByKey, I have a few questions.
** Over time, will spark stage original data somewhere to replay in case of
failures? Say the Spark job run for weeks, I am wondering how that sustains?
** Say my reduce key space is partitioned by some date field and I would
like to stop processing old dates after a period time (this is not a simply
windowing scenario as which date the data belongs to is not the same thing
when the data arrives). How can I handle this to tell spark to discard data
for old dates?

Thank you,

Best
Chen




-- 
Chen Song


Worker dies while submitting a job

2014-06-16 Thread Luis Ángel Vicente Sánchez
I'm playing with a modified version of the TwitterPopularTags example and
when I tried to submit the job to my cluster, workers keep dying with this
message:

14/06/16 17:11:16 INFO DriverRunner: Launch Command: java -cp
/opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar
-XX:MaxPermSize=128m -Xms512M -Xmx512M
org.apache.spark.deploy.worker.DriverWrapper
akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker
org.apache.spark.examples.streaming.TwitterPopularTags
14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class
scala.Enumeration$Val)
scala.MatchError: FAILED (of class scala.Enumeration$Val)
at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/06/16 17:11:17 INFO Worker: Starting Spark worker
int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM
14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1
14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at
http://int-spark-app-ie005d6a3.mclabs.io:8081
14/06/16 17:11:17 INFO Worker: Connecting to master
spark://int-spark-app-ie005d6a3.mclabs.io:7077...
14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to
re-register worker at same address: akka.tcp://
sparkwor...@int-spark-app-ie005d6a3.mclabs.io:51676

This happens when the worker receive a DriverStateChanged(driverId, state,
exception) message.

To deploy the job I copied the jar file to the temporary folder of master
node and execute the following command:

./spark-submit \
--class org.apache.spark.examples.streaming.TwitterPopularTags \
--master spark://int-spark-master:7077 \
--deploy-mode cluster \
file:///tmp/spark-test-0.1-SNAPSHOT.jar

I don't really know what the problem could be as there is a 'case _' that
should avoid that problem :S


Re: pyspark regression results way off

2014-06-16 Thread DB Tsai
Is your data normalized? Sometimes, GD doesn't work well if the data
has wide range. If you are willing to write scala code, you can try
LBFGS optimizer which converges better than GD.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 16, 2014 at 8:14 AM, jamborta jambo...@gmail.com wrote:
 forgot to mention that I'm running spark 1.0



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Accessing the per-key state maintained by updateStateByKey for transformation of JavaPairDStream

2014-06-16 Thread Gaurav Jain
Hello Spark Streaming Experts

I have a use-case, where I have a bunch of log-entries coming in, say every
10 seconds (Batch-interval). I create a JavaPairDStream[K,V] from these
log-entries. Now, there are two things I want to do with this
JavaPairDStream:

1. Use key-dependent state (updated by updateStateByKey) to apply a
transformation function on the JavaPairDStream[K, V]. I know that we get a
JavaPairDStream[K, S] as return value of updateStateByKey. However, I can't
possibly pass a JavaPairDStream to a transformation function, nor can I
convert JavaPairDStream[K,S]  to let's say a HashMapK,S (Or is there a way
to do this?). Even if I could convert it to a HashMapK,S, could I really
pass it to a transformation function, since this HashMapK,S changes after
every batch computation?

2. Update key-dependent state using IterableV: This should be easily
doable using updateStateByKey.

In a nutshell, how do I access the very state updated by updateStateByKey
for applying let's say a map function on the JavaPairDStream[K,V]. Note that
I am not using any sliding windows at all. Just plain batches. 

Thanks
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-the-per-key-state-maintained-by-updateStateByKey-for-transformation-of-JavaPairDStream-tp7680.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
It’s true that it can’t. You can try to use the CloudPickle library instead, 
which is what we use within PySpark to serialize functions (see 
python/pyspark/cloudpickle.py). However I’m also curious, why do you need an 
RDD of functions?

Matei

On Jun 15, 2014, at 4:49 PM, madeleine madeleine.ud...@gmail.com wrote:

 It seems that the default serializer used by pyspark can't serialize a list
 of functions.
 I've seen some posts about trying to fix this by using dill to serialize
 rather than pickle. 
 Does anyone know what the status of that project is, or whether there's
 another easy workaround?
 
 I've pasted a sample error message below. Here, regs is a function defined
 in another file myfile.py that has been included on all workers via the
 pyFiles argument to SparkContext: sc = SparkContext(local,
 myapp,pyFiles=[myfile.py]).
 
  File runfile.py, line 45, in __init__
regsRDD = sc.parallelize([regs]*self.n)
  File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py,
 line 223, in parallelize
serializer.dump_stream(c, tempFile)
  File
 /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line
 182, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File
 /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line
 118, in dump_stream
self._write_with_length(obj, stream)
  File
 /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line
 128, in _write_with_length
serialized = self.dumps(obj)
  File
 /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line
 270, in dumps
def dumps(self, obj): return cPickle.dumps(obj, 2)
 cPickle.PicklingError: Can't pickle type 'function': attribute lookup
 __builtin__.function failed
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread Xiangrui Meng
Someone is working on weighted regularization. Stay tuned. -Xiangrui

On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
fixed-term.congrui...@us.bosch.com wrote:
 Hi Xiangrui,

 Thank you for the reply! I have tried customizing 
 LogisticRegressionSGD.optimizer as in the example you mentioned, but the 
 source code reveals that the intercept is also penalized if one is included, 
 which is usually inappropriate. The developer should fix this problem.

 Best,

 Congrui

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Friday, June 13, 2014 11:50 PM
 To: user@spark.apache.org
 Cc: user
 Subject: Re: MLlib-Missing Regularization Parameter and Intercept for 
 Logistic Regression

 1. 
 examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
 contains example code that shows how to set regParam.

 2. A static method with more than 3 parameters becomes hard to
 remember and hard to maintain. Please use LogistricRegressionWithSGD's
 default constructor and setters.

 -Xiangrui


Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai
Hi Congrui,

We're working on weighted regularization, so for intercept, you can
just set it as 0. It's also useful when the data is normalized but
want to solve the regularization with original data.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 16, 2014 at 11:18 AM, Xiangrui Meng men...@gmail.com wrote:
 Someone is working on weighted regularization. Stay tuned. -Xiangrui

 On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
 fixed-term.congrui...@us.bosch.com wrote:
 Hi Xiangrui,

 Thank you for the reply! I have tried customizing 
 LogisticRegressionSGD.optimizer as in the example you mentioned, but the 
 source code reveals that the intercept is also penalized if one is included, 
 which is usually inappropriate. The developer should fix this problem.

 Best,

 Congrui

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Friday, June 13, 2014 11:50 PM
 To: user@spark.apache.org
 Cc: user
 Subject: Re: MLlib-Missing Regularization Parameter and Intercept for 
 Logistic Regression

 1. 
 examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
 contains example code that shows how to set regParam.

 2. A static method with more than 3 parameters becomes hard to
 remember and hard to maintain. Please use LogistricRegressionWithSGD's
 default constructor and setters.

 -Xiangrui


RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Thank you! I'm really looking forward to that.

Best,

Congrui

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Monday, June 16, 2014 11:19 AM
To: user@spark.apache.org
Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic 
Regression

Someone is working on weighted regularization. Stay tuned. -Xiangrui

On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
fixed-term.congrui...@us.bosch.com wrote:
 Hi Xiangrui,

 Thank you for the reply! I have tried customizing 
 LogisticRegressionSGD.optimizer as in the example you mentioned, but the 
 source code reveals that the intercept is also penalized if one is included, 
 which is usually inappropriate. The developer should fix this problem.

 Best,

 Congrui

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Friday, June 13, 2014 11:50 PM
 To: user@spark.apache.org
 Cc: user
 Subject: Re: MLlib-Missing Regularization Parameter and Intercept for 
 Logistic Regression

 1. 
 examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
 contains example code that shows how to set regParam.

 2. A static method with more than 3 parameters becomes hard to
 remember and hard to maintain. Please use LogistricRegressionWithSGD's
 default constructor and setters.

 -Xiangrui


Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai
Hi Congrui,

I mean create your own TrainMLOR.scala with all the code provided in
the example, and have it under package org.apache.spark.mllib

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 13, 2014 at 1:50 PM, Congrui Yi
fixed-term.congrui...@us.bosch.com wrote:
 Hi DB,

 Thank you for the help! I'm new to this, so could you give a bit more
 details how this could be done?

 Sincerely,

 Congrui Yi





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread Congrui Yi
Hi DB,

Thank you for the reply! I'm looking forward to this change, which surely adds 
much more flexibility to the optimizers, including whether or not the intercept 
should be penalized.

Sincerely,

Congrui Yi

From: DB Tsai-2 [via Apache Spark User List] 
[mailto:ml-node+s1001560n768...@n3.nabble.com]
Sent: Monday, June 16, 2014 11:31 AM
To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic 
Regression

Hi Congrui,

We're working on weighted regularization, so for intercept, you can
just set it as 0. It's also useful when the data is normalized but
want to solve the regularization with original data.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 16, 2014 at 11:18 AM, Xiangrui Meng [hidden 
email]/user/SendEmail.jtp?type=nodenode=7684i=0 wrote:

 Someone is working on weighted regularization. Stay tuned. -Xiangrui

 On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
 [hidden email]/user/SendEmail.jtp?type=nodenode=7684i=1 wrote:
 Hi Xiangrui,

 Thank you for the reply! I have tried customizing 
 LogisticRegressionSGD.optimizer as in the example you mentioned, but the 
 source code reveals that the intercept is also penalized if one is included, 
 which is usually inappropriate. The developer should fix this problem.

 Best,

 Congrui

 -Original Message-
 From: Xiangrui Meng [mailto:[hidden 
 email]/user/SendEmail.jtp?type=nodenode=7684i=2]
 Sent: Friday, June 13, 2014 11:50 PM
 To: [hidden email]/user/SendEmail.jtp?type=nodenode=7684i=3
 Cc: user
 Subject: Re: MLlib-Missing Regularization Parameter and Intercept for 
 Logistic Regression

 1. 
 examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
 contains example code that shows how to set regParam.

 2. A static method with more than 3 parameters becomes hard to
 remember and hard to maintain. Please use LogistricRegressionWithSGD's
 default constructor and setters.

 -Xiangrui


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Missing-Regularization-Parameter-and-Intercept-for-Logistic-Regression-tp7588p7684.html
To unsubscribe from MLlib-Missing Regularization Parameter and Intercept for 
Logistic Regression, click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7588code=Zml4ZWQtdGVybS5Db25ncnVpLllpQHVzLmJvc2NoLmNvbXw3NTg4fDEwMDQ0NzI0MDQ=.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Missing-Regularization-Parameter-and-Intercept-for-Logistic-Regression-tp7588p7687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Bertrand Dechoux
I guess you have to understand the difference of architecture. I don't know
much about C++ MPI but it is basically MPI whereas Spark is inspired from
Hadoop MapReduce and optimised for reading/writing large amount of data
with a smart caching and locality strategy. Intuitively, if you have a high
ratio CPU/message then MPI might be better. But what is the ratio is hard
to say and in the end this ratio will depend on your specific application.
Finally, in real life, this difference of performance due to the
architecture may not be the only or the most important factor of choice
like Michael already explained.

Bertrand

On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote:

 Hello Wei,

 I talk from experience of writing many HPC distributed application using
 Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
 Virtual Machine (PVM) way before that back in the 90's.  I can say with
 absolute certainty:

 *Any gains you believe there are because C++ is faster than Java/Scala
 will be completely blown by the inordinate amount of time you spend
 debugging your code and/or reinventing the wheel to do even basic tasks
 like linear regression.*


 There are undoubtably some very specialised use-cases where MPI and its
 brethren still dominate for High Performance Computing tasks -- like for
 example the nuclear decay simulations run by the US Department of Energy on
 supercomputers where they've invested billions solving that use case.

 Spark is part of the wider Big Data ecosystem, and its biggest
 advantages are traction amongst internet scale companies, hundreds of
 developers contributing to it and a community of thousands using it.

 Need a distributed fault-tolerant file system? Use HDFS.  Need a
 distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
 between your worker processes? Use Zookeeper.  Need to run it on a flexible
 grid of computing resources and handle failures? Run it on Mesos!

 The barrier to entry to get going with Spark is very low, download the
 latest distribution and start the Spark shell.  Language bindings for Scala
 / Java / Python are excellent meaning you spend less time writing
 boilerplate code, and more time solving problems.

 Even if you believe you *need* to use native code to do something
 specific, like fetching HD video frames from satellite video capture cards
 -- wrap it in a small native library and use the Java Native Access
 interface to call it from your Java/Scala code.

 Have fun, and if you get stuck we're here to help!

 MC


 On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote:

 Hi guys,
 We are making choices between C++ MPI and Spark. Is there any official
 comparation between them? Thanks a lot!

 Wei





RE: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread Congrui Yi
Thank you! I'll try it out.

From: DB Tsai-2 [via Apache Spark User List] 
[mailto:ml-node+s1001560n7686...@n3.nabble.com]
Sent: Monday, June 16, 2014 11:32 AM
To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Subject: Re: MLlib-a problem of example code for L-BFGS

Hi Congrui,

I mean create your own TrainMLOR.scala with all the code provided in
the example, and have it under package org.apache.spark.mllib

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 13, 2014 at 1:50 PM, Congrui Yi
[hidden email]/user/SendEmail.jtp?type=nodenode=7686i=0 wrote:

 Hi DB,

 Thank you for the help! I'm new to this, so could you give a bit more
 details how this could be done?

 Sincerely,

 Congrui Yi





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7686.html
To unsubscribe from MLlib-a problem of example code for L-BFGS, click 
herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7589code=Zml4ZWQtdGVybS5Db25ncnVpLllpQHVzLmJvc2NoLmNvbXw3NTg5fDEwMDQ0NzI0MDQ=.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7689.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Luis Ángel Vicente Sánchez
Did you manage to make it work? I'm facing similar problems and this a
serious blocker issue. spark-submit seems kind of broken to me if you can
use it for spark-streaming.

Regards,

Luis


2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com:

 I am using Spark 1.0.0 compiled with Hadoop 1.2.1.

 I have a toy spark-streaming-kafka program.  It reads from a kafka queue
 and
 does

 stream
   .map {case (k, v) = (v, 1)}
   .reduceByKey(_ + _)
   .print()

 using a 1 second interval on the stream.

 The docs say to make Spark and Hadoop jars 'provided' but this breaks for
 spark-streaming.  Including spark-streaming (and spark-streaming-kafka) as
 'compile' to sweep them into our assembly gives collisions on javax.*
 classes.  To work around this I modified
 $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
 spark-streaming-kafka, and zkclient.  (Note that kafka is included as
 'compile' in my project and picked up in the assembly.)

 I have set up conf/spark-env.sh as needed.  I have copied my assembly to
 /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory.  I am
 running spark-submit from my spark master.  I am guided by the information
 here https://spark.apache.org/docs/latest/submitting-applications.html

 Well at this point I was going to detail all the ways spark-submit fails to
 follow it's own documentation.  If I do not invoke sparkContext.setJars()
 then it just fails to find the driver class.  This is using various
 combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??,
 and local: prefixes on the application-jar and --jars arguments.

 If I invoke sparkContext.setJars() and include my assembly jar I get
 further.  At this point I get a failure from
 kafka.consumer.ConsumerConnector not being found.  I suspect this is
 because
 spark-streaming-kafka needs the Kafka dependency it but my assembly jar is
 too late in the classpath.

 At this point I try setting spark.files.userClassPathfirst to 'true' but
 this causes more things to blow up.

 I finally found something that works.  Namely setting environment variable
 SPARK_CLASSPATH=/tmp/myjar.jar  But silly me, this is deprecated and I'm
 helpfully informed to

   Please instead use:
- ./spark-submit with --driver-class-path to augment the driver
 classpath
- spark.executor.extraClassPath to augment the executor classpath

 which when put into a file and introduced with --properties-file does not
 work.  (Also tried spark.files.userClassPathFirst here.)  These fail with
 the kafka.consumer.ConsumerConnector error.

 At a guess what's going on is that using SPARK_CLASSPATH I have my assembly
 jar in the classpath at SparkSubmit invocation

   Spark Command: java -cp

 /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
 org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC
 /tmp/myjar.jar

 but using --properties-file then the assembly is not available for
 SparkSubmit.

 I think the root cause is either spark-submit not handling the
 spark-streaming libraries so they can be 'provided' or the inclusion of
 org.elicpse.jetty.orbit in the streaming libraries which cause

   [error] (*:assembly) deduplicate: different file contents found in the
 following:
   [error]

 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
   [error]

 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
   [error]

 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
   [error]

 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA

 I've tried applying mergeStategy in assembly for my assembly.sbt but then I
 get

   Invalid signature file digest for Manifest main attributes

 If anyone knows the magic to get this working a reply would be greatly
 appreciated.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-SPARK-CLASSPATH-tp7356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



pyspark-Failed to run first

2014-06-16 Thread Congrui Yi
Hi All,

I am just trying to compare Scala and Python API in my local machine. Just
tried to import a local matrix(1000 by 10, created in R) stored in a text
file via textFile in pyspark. when I run data.first() it fails to present
the line and give error messages including the next:
 

Then I did nothing except changing the number of rows to 500 and importing
the file again. data.first() runs correctly. 

I also tried these in scala using spark-shell, which runs correctly for both
cases and larger matrices.

Could somebody help me with this problem? I couldn't find an answer on the
internet. It looks like pyspark has a problem with this simplest step?

Best,

Congrui Yi




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


No Intercept for Python

2014-06-16 Thread Naftali Harris
Hi everyone,

The Python LogisticRegressionWithSGD does not appear to estimate an
intercept.  When I run the following, the returned weights and intercept
are both 0.0:

from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD

def main():
sc = SparkContext(appName=NoIntercept)

train = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [0]),
LabeledPoint(1, [0])])

model = LogisticRegressionWithSGD.train(train, iterations=500, step=0.1)
print Final weights:  + str(model.weights)
print Final intercept:  + str(model.intercept)

if __name__ == __main__:
main()


Of course, one can fit an intercept with the simple expedient of adding a
column of ones, but that's kind of annoying.  Moreover, it looks like the
scala version has an intercept option.

Am I missing something? Should I just add the column of ones? If I
submitted a PR doing that, is that the sort of thing you guys would accept?

Thanks! :-)

Naftali


Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-16 Thread Piotr Kołaczkowski
We are having the same problem. We're running Spark 0.9.1 in standalone
mode and on some heavy jobs workers become unresponsive and marked by
master as dead, even though the worker process is still running. Then they
never join the cluster again and cluster becomes essentially unusable until
we restart each worker.

We'd like to know:
1. Why worker can become unresponsive? Are there any well known config /
usage pitfalls that we could have fallen into? We're still investigating
the issue, but maybe there are some hints?
2. Is there an option to auto-recover a worker? e.g. automatically start a
new one if the old one failed? or at least some hooks to implement
functionality liek that?

Thanks,
Piotr


2014-06-13 22:58 GMT+02:00 Gino Bustelo g...@bustelos.com:

 I get the same problem, but I'm running in a dev environment based on
 docker scripts. The additional issue is that the worker processes do not
 die and so the docker container does not exit. So I end up with worker
 containers that are not participating in the cluster.


 On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 I have also had trouble in worker joining the working set. I have
 typically moved to Mesos based setup. Frankly for high availability you are
 better off using a cluster manager.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 Hi, I see this has been asked before but has not gotten any satisfactory
 answer so I'll try again:

 (here is the original thread I found:
 http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E
 )

 I have a set of workers dying and coming back again. The master prints
 the following warning:

 Got heartbeat from unregistered worker 

 What is the solution to this -- rolling the master is very undesirable
 to me as I have a Shark context sitting on top of it (it's meant to be
 highly available).

 Insights appreciated -- I don't think an executor going down is very
 unexpected but it does seem odd that it won't be able to rejoin the working
 set.

 I'm running Spark 0.9.1 on CDH







-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404


Re: pyspark serializer can't handle functions?

2014-06-16 Thread madeleine
Interesting! I'm curious why you use cloudpickle internally, but then use
standard pickle to serialize RDDs?

I'd like to create an RDD of functions because (I think) it's the most
natural way to express my problem. I have a matrix of functions; I'm trying
to find a low rank matrix that minimizes the sum of these functions
evaluated on the entries on the low rank matrix. For example, the problem
is PCA on the matrix A when the (i,j)th function is lambda z: (z-A[i,j])^2.
In general, each of these functions is defined using a two argument base
function lambda a,z: (z-a)^2 and the data A[i,j]; but it's somewhat cleaner
just to express the minimization problem in terms of the one argument
functions.

One other wrinkle is that I'm using alternating minimization, so I'll be
minimizing over the rows and columns of this matrix at alternating steps;
hence I need to store both the matrix and its transpose to avoid data
thrashing.


On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User
List] ml-node+s1001560n7682...@n3.nabble.com wrote:

 It’s true that it can’t. You can try to use the CloudPickle library
 instead, which is what we use within PySpark to serialize functions (see
 python/pyspark/cloudpickle.py). However I’m also curious, why do you need
 an RDD of functions?

 Matei

 On Jun 15, 2014, at 4:49 PM, madeleine [hidden email]
 http://user/SendEmail.jtp?type=nodenode=7682i=0 wrote:

  It seems that the default serializer used by pyspark can't serialize a
 list
  of functions.
  I've seen some posts about trying to fix this by using dill to serialize
  rather than pickle.
  Does anyone know what the status of that project is, or whether there's
  another easy workaround?
 
  I've pasted a sample error message below. Here, regs is a function
 defined
  in another file myfile.py that has been included on all workers via the
  pyFiles argument to SparkContext: sc = SparkContext(local,
  myapp,pyFiles=[myfile.py]).
 
   File runfile.py, line 45, in __init__
 regsRDD = sc.parallelize([regs]*self.n)
   File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py,
  line 223, in parallelize
 serializer.dump_stream(c, tempFile)
   File
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py,
 line
  182, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py,
 line
  118, in dump_stream
 self._write_with_length(obj, stream)
   File
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py,
 line
  128, in _write_with_length
 serialized = self.dumps(obj)
   File
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py,
 line
  270, in dumps
 def dumps(self, obj): return cPickle.dumps(obj, 2)
  cPickle.PicklingError: Can't pickle type 'function': attribute lookup
  __builtin__.function failed
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7682.html
  To unsubscribe from pyspark serializer can't handle functions?, click
 here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7650code=bWFkZWxlaW5lLnVkZWxsQGdtYWlsLmNvbXw3NjUwfC0yMDUyNTU5NTk5
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 
Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University
www.stanford.edu/~udell




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7694.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Can't get Master Kerberos principal for use as renewer

2014-06-16 Thread Finamore A.
Hi,

I'm a new user to Spark and I'm trying to integrate it in my cluster.
It's a small set of nodes running CDH 4.7 with kerberos.
The other services are fine with the authentication but I've some troubles
with spark.

First, I used the parcel available in cloudera manager (SPARK
0.9.0-1.cdh4.6.0.p0.98)
Since the cluster has CDH4.7 (not 4.6) I'm not sure if this can create
problems.
I've also tried with the new spark 1.0.0 with no luck ...

I've configured the environment as reported in
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
I'm using a standalone deployment.

When launching spark-shell (for testing), everything seems fine (the
process got registered with master)
But when I try to execute the example reported in the installation page,
Kerberos blocks the access to HDFS
scala val file = sc.textFile(hdfs://
m1hadoop.polito.it:8020/user/finamore/data)
14/06/16 22:28:36 INFO storage.MemoryStore: ensureFreeSpace(135653) called
with curMem=0, maxMem=308713881
14/06/16 22:28:36 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 132.5 KB, free 294.3 MB)
file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:12

scala val counts = file.flatMap(line = line.split( )).map(word =
(word, 1)).reduceByKey(_ + _)
java.io.IOException: Can't get Master Kerberos principal for use as renewer
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at
org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354)
at $iwC$$iwC$$iwC$$iwC.init(console:14)
at $iwC$$iwC$$iwC.init(console:19)
at $iwC$$iwC.init(console:21)
at $iwC.init(console:23)
at init(console:25)
at .init(console:29)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
at

Set comparison

2014-06-16 Thread SK
Hi,

I have a Spark method that returns RDD[String], which I am converting to a
set and then comparing it to the expected output as shown in the following
code. 

1. val expected_res = Set(ID1, ID2, ID3)  // expected output

2. val result:RDD[String] = getData(input)  //method returns RDD[String]
3. val set_val = result.collect().toSet // convert to set. 
4. println(set_val) // prints:  Set(ID1,
ID2, ID3)
5. println(expected_res)// prints:  Set(ID1, ID2,
ID3)

// verify output
6. if( set_val == expected_res)
7.println(true)  // this does not get printed

The value returned by the method is almost same as expected output, but the
verification is failing. I am not sure why the expected_res in Line 5 does
not print the quotes even though Line 1 has them. Could  that be the reason
the comparison is failing? What is the right way to do the above comparison?

thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark master UI does not keep detailed application history

2014-06-16 Thread Andrew Or
Are you referring to accessing a SparkUI for an application that has
finished? First you need to enable event logging while the application is
still running. In Spark 1.0, you set this by adding a line to
$SPARK_HOME/conf/spark-defaults.conf:

spark.eventLog.enabled true

Other than that, the content served on the master UI is largely the same as
before 1.0 is introduced.


2014-06-14 16:43 GMT-07:00 wxhsdp wxh...@gmail.com:

 hi, zhen
   i met the same problem in ec2, application details can not be accessed.
 but i can read stdout
   and stderr. the problem has not been solved yet



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-history-tp7608p7635.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Set comparison

2014-06-16 Thread Sean Owen
On Mon, Jun 16, 2014 at 10:09 PM, SK skrishna...@gmail.com wrote:

 The value returned by the method is almost same as expected output, but the
 verification is failing. I am not sure why the expected_res in Line 5 does
 not print the quotes even though Line 1 has them. Could  that be the reason
 the comparison is failing? What is the right way to do the above
 comparison?


Most likely because the strings in set_val do actually start and end with
double-quotes? That is just what the output suggests. It has a string like
ID1, not ID1


Re: Set comparison

2014-06-16 Thread SK
In Line 1, I have expected_res as a set of strings with quotes. So I thought
it would include the quotes during comparison.

Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and
that seems to work.

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark serializer can't handle functions?

2014-06-16 Thread Matei Zaharia
Ah, I see, interesting. CloudPickle is slower than the cPickle library, so 
that’s why we didn’t use it for data, but it should be possible to write a 
Serializer that uses it. Another thing you can do for this use case though is 
to define a class that represents your functions:

class MyFunc(object):
def __call__(self, argument):
return argument

f = MyFunc()

f(5)

Instances of a class like this should be pickle-able using the standard pickle 
serializer, though you may have to put the class in a separate .py file and 
include that in the list of .py files passed to SparkContext. And then in the 
code you can still use them as functions.

Matei

On Jun 16, 2014, at 1:12 PM, madeleine madeleine.ud...@gmail.com wrote:

 Interesting! I'm curious why you use cloudpickle internally, but then use 
 standard pickle to serialize RDDs?
 
 I'd like to create an RDD of functions because (I think) it's the most 
 natural way to express my problem. I have a matrix of functions; I'm trying 
 to find a low rank matrix that minimizes the sum of these functions evaluated 
 on the entries on the low rank matrix. For example, the problem is PCA on the 
 matrix A when the (i,j)th function is lambda z: (z-A[i,j])^2. In general, 
 each of these functions is defined using a two argument base function lambda 
 a,z: (z-a)^2 and the data A[i,j]; but it's somewhat cleaner just to express 
 the minimization problem in terms of the one argument functions. 
 
 One other wrinkle is that I'm using alternating minimization, so I'll be 
 minimizing over the rows and columns of this matrix at alternating steps; 
 hence I need to store both the matrix and its transpose to avoid data 
 thrashing.
 
 
 On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List] 
 [hidden email] wrote:
 It’s true that it can’t. You can try to use the CloudPickle library instead, 
 which is what we use within PySpark to serialize functions (see 
 python/pyspark/cloudpickle.py). However I’m also curious, why do you need an 
 RDD of functions? 
 
 Matei 
 
 On Jun 15, 2014, at 4:49 PM, madeleine [hidden email] wrote: 
 
  It seems that the default serializer used by pyspark can't serialize a list 
  of functions. 
  I've seen some posts about trying to fix this by using dill to serialize 
  rather than pickle. 
  Does anyone know what the status of that project is, or whether there's 
  another easy workaround? 
  
  I've pasted a sample error message below. Here, regs is a function defined 
  in another file myfile.py that has been included on all workers via the 
  pyFiles argument to SparkContext: sc = SparkContext(local, 
  myapp,pyFiles=[myfile.py]). 
  
   File runfile.py, line 45, in __init__ 
 regsRDD = sc.parallelize([regs]*self.n) 
   File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py, 
  line 223, in parallelize 
 serializer.dump_stream(c, tempFile) 
   File 
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 
  182, in dump_stream 
 self.serializer.dump_stream(self._batched(iterator), stream) 
   File 
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 
  118, in dump_stream 
 self._write_with_length(obj, stream) 
   File 
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 
  128, in _write_with_length 
 serialized = self.dumps(obj) 
   File 
  /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 
  270, in dumps 
 def dumps(self, obj): return cPickle.dumps(obj, 2) 
  cPickle.PicklingError: Can't pickle type 'function': attribute lookup 
  __builtin__.function failed 
  
  
  
  -- 
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7682.html
 To unsubscribe from pyspark serializer can't handle functions?, click here.
 NAML
 
 
 
 -- 
 Madeleine Udell
 PhD Candidate in Computational and Mathematical Engineering
 Stanford University
 www.stanford.edu/~udell
 
 View this message in context: Re: pyspark serializer can't handle functions?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
With the help from the Accumulo guys, I probably know why.

I'm using the binary distro of Spark and Base64 is from spark-assembly.jar
and it probably uses an older version of commons-codec.

I'll need to reinstall spark from source.

Jianshi


On Mon, Jun 16, 2014 at 9:18 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Hi

 Check in your driver programs Environment, (eg:
 http://192.168.1.39:4040/environment/). If you don't see this
 commons-codec-1.7.jar jar then that's the issue.

 Thanks
 Best Regards


 On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat.
 It went all well on my laptop (Accumulo MockInstance + Spark Local mode).

 But when I try to submit it to the yarn cluster, the yarn logs shows the
 following error message:

 14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler:
 YarnClientClusterScheduler.postStartHook done
 Exception in thread main java.lang.NoSuchMethodError:
 org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String;
 at
 org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127)
 at
 org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92)
 at
 com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44)
 at
 com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 Looks like Accumulo's dependency has got problems.

 Does Anyone know what's wrong with my code or settings? I've added all
 needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar
 has been uploaded to hdfs.

 14/06/16 04:36:02 INFO yarn.Client: Uploading
 file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to
 hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar



 And here's my spark-submit cmd (all JARs needed are concatenated after
 --jars):

 ~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class
 com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars
 `find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4
 --executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar

 I've tried both cluster mode and client mode and neither worked.


 BTW, I tried to use sbt-assembly to created a bundled jar, however I
 always got the following error:

 [error] (*:assembly) deduplicate: different file contents found in the
 following:
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
 [error]
 /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA

 I googled it and looks like I need to exclude some JARs. Anyone has done
 that? Your help is really appreciated.



 Cheers,

 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


spark with docker: errors with akka, NAT?

2014-06-16 Thread Mohit Jaggi
Hi Folks,

I am having trouble getting spark driver running in docker. If I run a
pyspark example on my mac it works but the same example on a docker image
(Via boot2docker) fails with following logs. I am pointing the spark driver
(which is running the example) to a spark cluster (driver is not part of
the cluster). I guess this has something to do with docker's networking
stack (it may be getting NAT'd) but I am not sure why (if at all) the
spark-worker or spark-master is trying to create a new TCP connection to
the driver, instead of responding on the connection initiated by the driver.

I would appreciate any help in figuring this out.

Thanks,

Mohit.

logs

Spark Executor Command: java -cp
::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
-Xms2g -Xmx2g -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1
cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker
app-20140616152201-0021




log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.

14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties

14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root

14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(xxx, xxx)

14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started

14/06/16 15:22:05 INFO Remoting: Starting remoting

14/06/16 15:22:06 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkExecutor@:33536]

14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@:33536]

14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler

14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@:33952/user/Worker

14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for
6 ms, all messages to this address will be delivered to dead letters.

14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@:33536] - [akka.tcp://spark@fc31887475e3:43921]
disassociated! Shutting down.


Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Tom Vacek
Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather.  There are also a few additional primitives, mostly
based on a join.  Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal.  Spark has one theoretical disadvantage
compared to MPI: every collective operation requires the task closures to
be distributed, and---to my knowledge---this is an O(p) operation.
 (Perhaps there has been some progress on this??)  That O(p) term spoils
any parallel isoefficiency analysis.  In MPI, binaries are distributed
once, and wireup is a O(log p).  In practice, it prevents
reasonable-looking strong scaling curves; with MPI, the overall runtime
will stop declining and level off with increasing p, but with Spark it can
go up sharply.  So, Spark is great for a small cluster.  For a huge
cluster, or a job with a lot of collectives, it isn't so great.


On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux decho...@gmail.com
wrote:

 I guess you have to understand the difference of architecture. I don't
 know much about C++ MPI but it is basically MPI whereas Spark is inspired
 from Hadoop MapReduce and optimised for reading/writing large amount of
 data with a smart caching and locality strategy. Intuitively, if you have a
 high ratio CPU/message then MPI might be better. But what is the ratio is
 hard to say and in the end this ratio will depend on your specific
 application. Finally, in real life, this difference of performance due to
 the architecture may not be the only or the most important factor of choice
 like Michael already explained.

 Bertrand

 On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote:

 Hello Wei,

 I talk from experience of writing many HPC distributed application using
 Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
 Virtual Machine (PVM) way before that back in the 90's.  I can say with
 absolute certainty:

 *Any gains you believe there are because C++ is faster than Java/Scala
 will be completely blown by the inordinate amount of time you spend
 debugging your code and/or reinventing the wheel to do even basic tasks
 like linear regression.*


 There are undoubtably some very specialised use-cases where MPI and its
 brethren still dominate for High Performance Computing tasks -- like for
 example the nuclear decay simulations run by the US Department of Energy on
 supercomputers where they've invested billions solving that use case.

 Spark is part of the wider Big Data ecosystem, and its biggest
 advantages are traction amongst internet scale companies, hundreds of
 developers contributing to it and a community of thousands using it.

 Need a distributed fault-tolerant file system? Use HDFS.  Need a
 distributed/fault-tolerant message-queue? Use Kafka.  Need to co-ordinate
 between your worker processes? Use Zookeeper.  Need to run it on a flexible
 grid of computing resources and handle failures? Run it on Mesos!

 The barrier to entry to get going with Spark is very low, download the
 latest distribution and start the Spark shell.  Language bindings for Scala
 / Java / Python are excellent meaning you spend less time writing
 boilerplate code, and more time solving problems.

 Even if you believe you *need* to use native code to do something
 specific, like fetching HD video frames from satellite video capture cards
 -- wrap it in a small native library and use the Java Native Access
 interface to call it from your Java/Scala code.

 Have fun, and if you get stuck we're here to help!

 MC


 On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote:

 Hi guys,
 We are making choices between C++ MPI and Spark. Is there any official
 comparation between them? Thanks a lot!

 Wei






Spark sql unable to connect to db2 hive metastore

2014-06-16 Thread Jenny Zhao
Hi,

my hive configuration use db2 as it's metastore database, I have built
spark with the extra step sbt/sbt assembly/assembly to include the
dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf.
when I ran :

hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))

got following exception, pasted portion of the stack trace here, looking at
the stack, this made me wondering if Spark supports remote metastore
configuration, it seems spark doesn't talk to hiveserver2 directly?  the
driver jars: db2jcc-10.5.jar, db2jcc_license_cisuz-10.5.jar both are
included in the classpath, otherwise, it will complain it couldn't find the
driver.

Appreciate any help to resolve it.

Thanks!

Caused by: java.sql.SQLException: Unable to open a test connection to the
given database. JDBC url = jdbc:db2://localhost:50001/BIDB, username =
catalog. Terminating connection pool. Original Exception: --
java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getConnection(DriverManager.java:422)
at java.sql.DriverManager.getConnection(DriverManager.java:374)
at
com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:254)
at com.jolbox.bonecp.BoneCP.init(BoneCP.java:305)
at
com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150)
at
com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112)
at
org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:479)
at
org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:304)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39)
at java.lang.reflect.Constructor.newInstance(Constructor.java:527)
at
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at
org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069)
at
org.datanucleus.NucleusContext.initialise(NucleusContext.java:359)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326)
at
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:611)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at
java.security.AccessController.doPrivileged(AccessController.java:277)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at
javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:234)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:209)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RetryingRawStore.init(RetryingRawStore.java:64)
at
org.apache.hadoop.hive.metastore.RetryingRawStore.getProxy(RetryingRawStore.java:73)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:415)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:402)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:286)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
   

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-16 Thread Gino Bustelo
+1 for this issue. Documentation for spark-submit are misleading. Among many 
issues, the jar support is bad. HTTP urls do not work. This is because spark is 
using hadoop's FileSystem class. You have to specify the jars twice to get 
things to work. Once for the DriverWrapper to laid your classes and a 2nd time 
in the Context to distribute to workers. 

I would like to see some contrib response to this issue. 

Gino B.

 On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com wrote:
 
 Did you manage to make it work? I'm facing similar problems and this a 
 serious blocker issue. spark-submit seems kind of broken to me if you can use 
 it for spark-streaming.
 
 Regards,
 
 Luis
 
 
 2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com:
 I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
 
 I have a toy spark-streaming-kafka program.  It reads from a kafka queue and
 does
 
 stream
   .map {case (k, v) = (v, 1)}
   .reduceByKey(_ + _)
   .print()
 
 using a 1 second interval on the stream.
 
 The docs say to make Spark and Hadoop jars 'provided' but this breaks for
 spark-streaming.  Including spark-streaming (and spark-streaming-kafka) as
 'compile' to sweep them into our assembly gives collisions on javax.*
 classes.  To work around this I modified
 $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
 spark-streaming-kafka, and zkclient.  (Note that kafka is included as
 'compile' in my project and picked up in the assembly.)
 
 I have set up conf/spark-env.sh as needed.  I have copied my assembly to
 /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory.  I am
 running spark-submit from my spark master.  I am guided by the information
 here https://spark.apache.org/docs/latest/submitting-applications.html
 
 Well at this point I was going to detail all the ways spark-submit fails to
 follow it's own documentation.  If I do not invoke sparkContext.setJars()
 then it just fails to find the driver class.  This is using various
 combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??,
 and local: prefixes on the application-jar and --jars arguments.
 
 If I invoke sparkContext.setJars() and include my assembly jar I get
 further.  At this point I get a failure from
 kafka.consumer.ConsumerConnector not being found.  I suspect this is because
 spark-streaming-kafka needs the Kafka dependency it but my assembly jar is
 too late in the classpath.
 
 At this point I try setting spark.files.userClassPathfirst to 'true' but
 this causes more things to blow up.
 
 I finally found something that works.  Namely setting environment variable
 SPARK_CLASSPATH=/tmp/myjar.jar  But silly me, this is deprecated and I'm
 helpfully informed to
 
   Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
 
 which when put into a file and introduced with --properties-file does not
 work.  (Also tried spark.files.userClassPathFirst here.)  These fail with
 the kafka.consumer.ConsumerConnector error.
 
 At a guess what's going on is that using SPARK_CLASSPATH I have my assembly
 jar in the classpath at SparkSubmit invocation
 
   Spark Command: java -cp
 /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
 org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC
 /tmp/myjar.jar
 
 but using --properties-file then the assembly is not available for
 SparkSubmit.
 
 I think the root cause is either spark-submit not handling the
 spark-streaming libraries so they can be 'provided' or the inclusion of
 org.elicpse.jetty.orbit in the streaming libraries which cause
 
   [error] (*:assembly) deduplicate: different file contents found in the
 following:
   [error]
 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
   [error]
 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
   [error]
 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
   [error]
 /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
 
 I've tried applying mergeStategy in assembly for my assembly.sbt but then I
 get
 
   Invalid signature file digest for Manifest main attributes
 
 If anyone knows the magic to get this working a reply would be greatly
 appreciated.
 
 
 
 --
 View this message in context: 
 

Re: Set comparison

2014-06-16 Thread Ye Xianjin
If you want string with quotes, you have to escape it with '\'. It's exactly 
what you did in the modified version.

Sent from my iPhone

 On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote:
 
 In Line 1, I have expected_res as a set of strings with quotes. So I thought
 it would include the quotes during comparison.
 
 Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and
 that seems to work.
 
 thanks.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


akka.FrameSize

2014-06-16 Thread Chen Jin
Hi all,

I have run into a very interesting bug which is not exactly as same as
Spark-1112.

Here is how to reproduce the bug, I have one input csv file and use
partitionBy function to create an RDD, say repartitionedRDD. The
partitionBy function takes the number of partitions as a parameter
such that we can vary the serialized size per partition easily for the
following experiments. At the end, I just simply call
repartitionedRDD.collect().

1) spark.akka.frameSize = 10
If one of the partition size is very close to 10MB, say 9.97MB, the
execution blocks without any exception or warning. Worker finished the
task to send the serialized result, and then throw exception saying
hadoop IPC client connection stops (changing the logging to debug
level). However, the master never receives the results and the program
just hangs.
But if sizes for all the partitions less than some number btw 9.96MB
amd 9.97MB, the program works fine.
2) spark.akka.frameSize = 9
when the partition size is just a little bit smaller than 9MB, it fails as well.
This bug behavior is not exactly what spark-1112 is about, could you
please guide me how to open a separate bug when the serialization size
is very close to 10MB.

I googled around and haven't found anything which relates to the
behavior we have found. Any insights or suggestions would be greatly
appreciated.

Thanks! :-)

-chen


Re: spark with docker: errors with akka, NAT?

2014-06-16 Thread Andre Schumacher

Hi,

are you using the amplab/spark-1.0.0 images from the global registry?

Andre

On 06/17/2014 01:36 AM, Mohit Jaggi wrote:
 Hi Folks,
 
 I am having trouble getting spark driver running in docker. If I run a
 pyspark example on my mac it works but the same example on a docker image
 (Via boot2docker) fails with following logs. I am pointing the spark driver
 (which is running the example) to a spark cluster (driver is not part of
 the cluster). I guess this has something to do with docker's networking
 stack (it may be getting NAT'd) but I am not sure why (if at all) the
 spark-worker or spark-master is trying to create a new TCP connection to
 the driver, instead of responding on the connection initiated by the driver.
 
 I would appreciate any help in figuring this out.
 
 Thanks,
 
 Mohit.
 
 logs
 
 Spark Executor Command: java -cp
 ::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 -Xms2g -Xmx2g -Xms512M -Xmx512M
 org.apache.spark.executor.CoarseGrainedExecutorBackend
 akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1
 cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker
 app-20140616152201-0021
 
 
 
 
 log4j:WARN No appenders could be found for logger
 (org.apache.hadoop.conf.Configuration).
 
 log4j:WARN Please initialize the log4j system properly.
 
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
 more info.
 
 14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j
 profile: org/apache/spark/log4j-defaults.properties
 
 14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root
 
 14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(xxx, xxx)
 
 14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started
 
 14/06/16 15:22:05 INFO Remoting: Starting remoting
 
 14/06/16 15:22:06 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkExecutor@:33536]
 
 14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkExecutor@:33536]
 
 14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver:
 akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler
 
 14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker
 akka.tcp://sparkWorker@:33952/user/Worker
 
 14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote
 address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for
 6 ms, all messages to this address will be delivered to dead letters.
 
 14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
 [akka.tcp://sparkExecutor@:33536] - [akka.tcp://spark@fc31887475e3:43921]
 disassociated! Shutting down.