Re: Kafka client - specify offsets?
Hi, there are apparently helpers to tell you the offsets https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads, but I have no idea how to pass that to the Kafka stream consumer. I am interested in that as well. Tobias On Thu, Jun 12, 2014 at 5:53 AM, Michael Campbell michael.campb...@gmail.com wrote: Is there a way in the Apache Spark Kafka Utils to specify an offset to start reading? Specifically, from the start of the queue, or failing that, a specific point?
Is There Any Benchmarks Comparing C++ MPI with Spark
Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei
Spark streaming with Redis? Working with large number of model objects at spark compute nodes.
We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of streams. Since we do not know which spark node will process specific RDDs , we need to make these models available at each Spark compute node. We are planning to use Redis as in-memory cache over Spark cluster to feed these models to the Spark compute nodes. Is it the right approach? We can not cache all models locally at all Spark compute nodes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-with-Redis-Working-with-large-number-of-model-objects-at-spark-compute-nodes-tp7663.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
What is the best way to handle transformations or actions that takes forever?
My transformations or actions has some external tool set dependencies and sometimes they just stuck somewhere and there is no way I can fix them. If I don't want the job to run forever, Do I need to implement several monitor threads to throws an exception when they stuck. Or the framework can already handle that? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best-way-to-handle-transformations-or-actions-that-takes-forever-tp7664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: wholeTextFiles not working with HDFS
Hi, I have the same exception. Can you tell me how did you fix it? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is There Any Benchmarks Comparing C++ MPI with Spark
Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is faster than Java/Scala will be completely blown by the inordinate amount of time you spend debugging your code and/or reinventing the wheel to do even basic tasks like linear regression.* There are undoubtably some very specialised use-cases where MPI and its brethren still dominate for High Performance Computing tasks -- like for example the nuclear decay simulations run by the US Department of Energy on supercomputers where they've invested billions solving that use case. Spark is part of the wider Big Data ecosystem, and its biggest advantages are traction amongst internet scale companies, hundreds of developers contributing to it and a community of thousands using it. Need a distributed fault-tolerant file system? Use HDFS. Need a distributed/fault-tolerant message-queue? Use Kafka. Need to co-ordinate between your worker processes? Use Zookeeper. Need to run it on a flexible grid of computing resources and handle failures? Run it on Mesos! The barrier to entry to get going with Spark is very low, download the latest distribution and start the Spark shell. Language bindings for Scala / Java / Python are excellent meaning you spend less time writing boilerplate code, and more time solving problems. Even if you believe you *need* to use native code to do something specific, like fetching HD video frames from satellite video capture cards -- wrap it in a small native library and use the Java Native Access interface to call it from your Java/Scala code. Have fun, and if you get stuck we're here to help! MC On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote: Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei
Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String
Hi, I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat. It went all well on my laptop (Accumulo MockInstance + Spark Local mode). But when I try to submit it to the yarn cluster, the yarn logs shows the following error message: 14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done Exception in thread main java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String; at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127) at org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92) at com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44) at com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Looks like Accumulo's dependency has got problems. Does Anyone know what's wrong with my code or settings? I've added all needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar has been uploaded to hdfs. 14/06/16 04:36:02 INFO yarn.Client: Uploading file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar And here's my spark-submit cmd (all JARs needed are concatenated after --jars): ~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars `find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4 --executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar I've tried both cluster mode and client mode and neither worked. BTW, I tried to use sbt-assembly to created a bundled jar, however I always got the following error: [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String
Hi Check in your driver programs Environment, (eg: http://192.168.1.39:4040/environment/). If you don't see this commons-codec-1.7.jar jar then that's the issue. Thanks Best Regards On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat. It went all well on my laptop (Accumulo MockInstance + Spark Local mode). But when I try to submit it to the yarn cluster, the yarn logs shows the following error message: 14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done Exception in thread main java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String; at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127) at org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92) at com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44) at com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Looks like Accumulo's dependency has got problems. Does Anyone know what's wrong with my code or settings? I've added all needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar has been uploaded to hdfs. 14/06/16 04:36:02 INFO yarn.Client: Uploading file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar And here's my spark-submit cmd (all JARs needed are concatenated after --jars): ~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars `find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4 --executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar I've tried both cluster mode and client mode and neither worked. BTW, I tried to use sbt-assembly to created a bundled jar, however I always got the following error: [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: guidance on simple unit testing with Spark
If you don't want to refactor your code, you can put your input into a test file. After the test runs, read the data from the output file you specified (probably want this to be a temp file and delete on exit). Of course, that is not really a unit test - Metei's suggestion is preferable (this is how we test). However, if you have a long and complex flow, you might unit test different parts, and then have an integration test which reads from the files and tests the whole flow together (I do this as well). On Fri, Jun 13, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you can’t pass one from outside, and your code has to read data from a file and write it to a file. It would be better to move your code for transforming data into a new function: def processData(lines: RDD[String]): RDD[String] = { // build and return your “res” variable } Then you can unit-test this directly on data you create in your program: val myLines = sc.parallelize(Seq(“line 1”, “line 2”)) val result = GetInfo.processData(myLines).collect() assert(result.toSet === Set(“res 1”, “res 2”)) Matei On Jun 13, 2014, at 2:42 PM, SK skrishna...@gmail.com wrote: Hi, I have looked through some of the test examples and also the brief documentation on unit testing at http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, but still dont have a good understanding of writing unit tests using the Spark framework. Previously, I have written unit tests using specs2 framework and have got them to work in Scalding. I tried to use the specs2 framework with Spark, but could not find any simple examples I could follow. I am open to specs2 or Funsuite, whichever works best with Spark. I would like some additional guidance, or some simple sample code using specs2 or Funsuite. My code is provided below. I have the following code in src/main/scala/GetInfo.scala. It reads a Json file and extracts some data. It takes the input file (args(0)) and output file (args(1)) as arguments. object GetInfo{ def main(args: Array[String]) { val inp_file = args(0) val conf = new SparkConf().setAppName(GetInfo) val sc = new SparkContext(conf) val res = sc.textFile(log_file) .map(line = { parse(line) }) .map(json = { implicit lazy val formats = org.json4s.DefaultFormats val aid = (json \ d \ TypeID).extract[Int] val ts = (json \ d \ TimeStamp).extract[Long] val gid = (json \ d \ ID).extract[String] (aid, ts, gid) } ) .groupBy(tup = tup._3) .sortByKey(true) .map(g = (g._1, g._2.map(_._2).max)) res.map(tuple= %s, %d.format(tuple._1, tuple._2)).saveAsTextFile(args(1)) } I would like to test the above code. My unit test is in src/test/scala. The code I have so far for the unit test appears below: import org.apache.spark._ import org.specs2.mutable._ class GetInfoTest extends Specification with java.io.Serializable{ val data = List ( (d: {TypeID = 10, Timestamp: 1234, ID: ID1}), (d: {TypeID = 11, Timestamp: 5678, ID: ID1}), (d: {TypeID = 10, Timestamp: 1357, ID: ID2}), (d: {TypeID = 11, Timestamp: 2468, ID: ID2}) ) val expected_out = List( (ID1,5678), (ID2,2468), ) A GetInfo job should { //* How do I pass data define above as input and output which GetInfo expects as arguments? ** val sc = new SparkContext(local, GetInfo) //*** how do I get the output *** //assuming out_buffer has the output I want to match it to the expected output match expected output in { ( out_buffer == expected_out) must beTrue } } } I would like some help with the tasks marked with in the unit test code above. If specs2 is not the right way to go, I am also open to FunSuite. I would like to know how to pass the input while calling my program from the unit test and get the output. Thanks for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: long GC pause during file.cache()
Thanks you all for advice including (1) using CMS GC (2) use multiple worker instance and (3) use Tachyon. I will try (1) and (2) first and report back what I found. I will also try JDK 7 with G1 GC. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Aaron Davidson ilike...@gmail.com To: user@spark.apache.org, Date: 06/15/2014 09:06 PM Subject:Re: long GC pause during file.cache() Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to be smaller and avoid GC pauses [1] See standalone documentation here: http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t check….) Best, -- Nan Zhu On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, -- Nan Zhu On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: Hi, Wei You may try to set JVM opts in spark-env.sh as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote: Hi, I have a single node (192G RAM) stand-alone spark, with memory configuration like this in spark-env.sh SPARK_WORKER_MEMORY=180g SPARK_MEM=180g In spark-shell I have a program like this: val file = sc.textFile(/localpath) //file size is 40G file.cache() val output = file.map(line = extract something from line) output.saveAsTextFile (...) When I run this program again and again, or keep trying file.unpersist() -- file.cache() -- output.saveAsTextFile(), the run time varies a lot, from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, from the stage monitoring GUI I observe big GC pause (some can be 10+ min). Of course when run-time is normal, say ~1 min, no significant GC is observed. The behavior seems somewhat random. Is there any JVM tuning I should do to prevent this long GC pause from happening? I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something like this: root 10994 1.7 0.6 196378000 1361496 pts/51 Sl+ 22:06 0:12 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@velos.io W: www.velos.io
Re: long GC pause during file.cache()
BTW: nowadays a single machine with huge RAM (200G to 1T) is really common. With virtualization you lose some performance. It would be ideal to see some best practice on how to use Spark in these state-of-art machines... Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Wei Tan/Watson/IBM@IBMUS To: user@spark.apache.org, Date: 06/16/2014 10:47 AM Subject:Re: long GC pause during file.cache() Thanks you all for advice including (1) using CMS GC (2) use multiple worker instance and (3) use Tachyon. I will try (1) and (2) first and report back what I found. I will also try JDK 7 with G1 GC. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From:Aaron Davidson ilike...@gmail.com To:user@spark.apache.org, Date:06/15/2014 09:06 PM Subject:Re: long GC pause during file.cache() Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to be smaller and avoid GC pauses [1] See standalone documentation here: http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t check….) Best, -- Nan Zhu On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind the WARNING in the logs you can set spark.executor.extraJavaOpts in your SparkConf obj Best, -- Nan Zhu On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: Hi, Wei You may try to set JVM opts in spark-env.sh as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote: Hi, I have a single node (192G RAM) stand-alone spark, with memory configuration like this in spark-env.sh SPARK_WORKER_MEMORY=180g SPARK_MEM=180g In spark-shell I have a program like this: val file = sc.textFile(/localpath) //file size is 40G file.cache() val output = file.map(line = extract something from line) output.saveAsTextFile (...) When I run this program again and again, or keep trying file.unpersist() -- file.cache() -- output.saveAsTextFile(), the run time varies a lot, from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, from the stage monitoring GUI I observe big GC pause (some can be 10+ min). Of course when run-time is normal, say ~1 min, no significant GC is observed. The behavior seems somewhat random. Is there any JVM tuning I should do to prevent this long GC pause from happening? I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something like this: root 10994 1.7 0.6 196378000 1361496 pts/51 Sl+ 22:06 0:12 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g org.apache.spark.deploy.SparkSubmit spark-shell --class org.apache.spark.repl.Main Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hira...@velos.io W: www.velos.io
pyspark regression results way off
Hi all, I am testing the regression methods (SGD) using pyspark. Tried to tune the parameters, but they are far off from the results obtained using R. Is there some way to set these parameters more efficiently? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: pyspark regression results way off
forgot to mention that I'm running spark 1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Memory footprint of Calliope: Spark - Cassandra writes
Hi, I've been doing some testing with Calliope as a way to do batch load from Spark into Cassandra. My initial results are promising on the performance area, but worrisome on the memory footprint side. I'm generating N records of about 50 bytes each and using the UPDATE mutator to insert them into C*. I get OOM if my memory is below 1GB per million of records, or about 50Mb of raw data (without counting any RDD/structural overhead). (See code [1]) (so, to avoid confusions: e.g.: I need 4GB RAM to save 4M of 50Byte records to Cassandra) That's an order of magnitude more than the RAW data. I understood that Calliope builds on top of the Hadoop support of Cassandra, which builds on top of SSTables and sstableloader. I would like to know what's the memory usage factor of Calliope and what parameters could I use to control/tune that. Any experience/advice on that? -kr, Gerard. [1] https://gist.github.com/maasg/68de6016bffe5e71b78c
RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression
Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Friday, June 13, 2014 11:50 PM To: user@spark.apache.org Cc: user Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression 1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor and setters. -Xiangrui
Need some Streaming help
Like many people, I'm trying to do hourly counts. The twist is that I don't want to count per hour of streaming, but per hour of the actual occurrence of the event (wall clock, say -mm-dd HH). My thought is to make the streaming window large enough that a full hour of streaming data would fit inside it. Since my window slides in small increments, I want to drop the lowest hour from the stream before persisting the results(since it would have been reduced during the previous batch and would be a partial count in the current). I have gotten this far Every line of the input files is parsed into Event(type, hour), and stream is a DStream[RDD[Event]] val evtCountsByHour = stream.map(evt = (evt, 1)) .reduceByKeyAndWindow(_+_, Seconds(secondsInWindow)) //hourly counts per event .mapPartitions(iter = iter.map(x=(x._1.hour,x))) My understanding is that at this point, the event counts are keyed by hour. 1. How do I detect the smallest key? I have seen some examples of partitionBy + mapPartitionsWithIndex and dropping the lowest index but can't figure out how to do it with a DStream. My gut feeling is that the first RDD in the stream has to contain the oldest data but that doesn't seem to be the case(printed from inside evtCountsByHour.foreachRDD) 2. If someone is further ahead with this type of problem, could you give some insight on how you approached it -- I think Streaming would be the correct approach since I don't really want to worry about data that was already processed and I want to process it continuously. I opted on reduceByKeyAndWindow with a large window as opposed to updateStateByKey as the hour the event occurred in is part of the key and I don't care to keep around that key once the next hour's events are coming in (I'm assuming RDDs outside the window are considered unreferenced). But I'd love to hear other suggestions if my logic is off.
Fwd: spark streaming questions
Hey I am new to spark streaming and apologize if these questions have been asked. * In StreamingContext, reduceByKey() seems to only work on the RDDs of the current batch interval, not including RDDs of previous batches. Is my understanding correct? * If the above statement is correct, what functions to use if one wants to do processing on the continuous stream batches of data? I see 2 functions, reduceByKeyAndWindow and updateStateByKey which serve this purpose. My use case is an aggregation and doesn't fit a windowing scenario. * As for updateStateByKey, I have a few questions. ** Over time, will spark stage original data somewhere to replay in case of failures? Say the Spark job run for weeks, I am wondering how that sustains? ** Say my reduce key space is partitioned by some date field and I would like to stop processing old dates after a period time (this is not a simply windowing scenario as which date the data belongs to is not the same thing when the data arrives). How can I handle this to tell spark to discard data for old dates? Thank you, Best Chen -- Chen Song
Worker dies while submitting a job
I'm playing with a modified version of the TwitterPopularTags example and when I tried to submit the job to my cluster, workers keep dying with this message: 14/06/16 17:11:16 INFO DriverRunner: Launch Command: java -cp /opt/spark-1.0.0-bin-hadoop1/work/driver-20140616171115-0014/spark-test-0.1-SNAPSHOT.jar:::/opt/spark-1.0.0-bin-hadoop1/conf:/opt/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar -XX:MaxPermSize=128m -Xms512M -Xmx512M org.apache.spark.deploy.worker.DriverWrapper akka.tcp://sparkWorker@int-spark-worker:51676/user/Worker org.apache.spark.examples.streaming.TwitterPopularTags 14/06/16 17:11:17 ERROR OneForOneStrategy: FAILED (of class scala.Enumeration$Val) scala.MatchError: FAILED (of class scala.Enumeration$Val) at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:317) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/06/16 17:11:17 INFO Worker: Starting Spark worker int-spark-app-ie005d6a3.mclabs.io:51676 with 2 cores, 6.5 GB RAM 14/06/16 17:11:17 INFO Worker: Spark home: /opt/spark-1.0.0-bin-hadoop1 14/06/16 17:11:17 INFO WorkerWebUI: Started WorkerWebUI at http://int-spark-app-ie005d6a3.mclabs.io:8081 14/06/16 17:11:17 INFO Worker: Connecting to master spark://int-spark-app-ie005d6a3.mclabs.io:7077... 14/06/16 17:11:17 ERROR Worker: Worker registration failed: Attempted to re-register worker at same address: akka.tcp:// sparkwor...@int-spark-app-ie005d6a3.mclabs.io:51676 This happens when the worker receive a DriverStateChanged(driverId, state, exception) message. To deploy the job I copied the jar file to the temporary folder of master node and execute the following command: ./spark-submit \ --class org.apache.spark.examples.streaming.TwitterPopularTags \ --master spark://int-spark-master:7077 \ --deploy-mode cluster \ file:///tmp/spark-test-0.1-SNAPSHOT.jar I don't really know what the problem could be as there is a 'case _' that should avoid that problem :S
Re: pyspark regression results way off
Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write scala code, you can try LBFGS optimizer which converges better than GD. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jun 16, 2014 at 8:14 AM, jamborta jambo...@gmail.com wrote: forgot to mention that I'm running spark 1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Accessing the per-key state maintained by updateStateByKey for transformation of JavaPairDStream
Hello Spark Streaming Experts I have a use-case, where I have a bunch of log-entries coming in, say every 10 seconds (Batch-interval). I create a JavaPairDStream[K,V] from these log-entries. Now, there are two things I want to do with this JavaPairDStream: 1. Use key-dependent state (updated by updateStateByKey) to apply a transformation function on the JavaPairDStream[K, V]. I know that we get a JavaPairDStream[K, S] as return value of updateStateByKey. However, I can't possibly pass a JavaPairDStream to a transformation function, nor can I convert JavaPairDStream[K,S] to let's say a HashMapK,S (Or is there a way to do this?). Even if I could convert it to a HashMapK,S, could I really pass it to a transformation function, since this HashMapK,S changes after every batch computation? 2. Update key-dependent state using IterableV: This should be easily doable using updateStateByKey. In a nutshell, how do I access the very state updated by updateStateByKey for applying let's say a map function on the JavaPairDStream[K,V]. Note that I am not using any sliding windows at all. Just plain batches. Thanks Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-the-per-key-state-maintained-by-updateStateByKey-for-transformation-of-JavaPairDStream-tp7680.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: pyspark serializer can't handle functions?
It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine madeleine.ud...@gmail.com wrote: It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sample error message below. Here, regs is a function defined in another file myfile.py that has been included on all workers via the pyFiles argument to SparkContext: sc = SparkContext(local, myapp,pyFiles=[myfile.py]). File runfile.py, line 45, in __init__ regsRDD = sc.parallelize([regs]*self.n) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py, line 223, in parallelize serializer.dump_stream(c, tempFile) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 128, in _write_with_length serialized = self.dumps(obj) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 270, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) cPickle.PicklingError: Can't pickle type 'function': attribute lookup __builtin__.function failed -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression
Someone is working on weighted regularization. Stay tuned. -Xiangrui On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA) fixed-term.congrui...@us.bosch.com wrote: Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Friday, June 13, 2014 11:50 PM To: user@spark.apache.org Cc: user Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression 1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor and setters. -Xiangrui
Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression
Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jun 16, 2014 at 11:18 AM, Xiangrui Meng men...@gmail.com wrote: Someone is working on weighted regularization. Stay tuned. -Xiangrui On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA) fixed-term.congrui...@us.bosch.com wrote: Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Friday, June 13, 2014 11:50 PM To: user@spark.apache.org Cc: user Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression 1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor and setters. -Xiangrui
RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression
Thank you! I'm really looking forward to that. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Monday, June 16, 2014 11:19 AM To: user@spark.apache.org Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression Someone is working on weighted regularization. Stay tuned. -Xiangrui On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA) fixed-term.congrui...@us.bosch.com wrote: Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Friday, June 13, 2014 11:50 PM To: user@spark.apache.org Cc: user Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression 1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor and setters. -Xiangrui
Re: MLlib-a problem of example code for L-BFGS
Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under package org.apache.spark.mllib Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jun 13, 2014 at 1:50 PM, Congrui Yi fixed-term.congrui...@us.bosch.com wrote: Hi DB, Thank you for the help! I'm new to this, so could you give a bit more details how this could be done? Sincerely, Congrui Yi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression
Hi DB, Thank you for the reply! I'm looking forward to this change, which surely adds much more flexibility to the optimizers, including whether or not the intercept should be penalized. Sincerely, Congrui Yi From: DB Tsai-2 [via Apache Spark User List] [mailto:ml-node+s1001560n768...@n3.nabble.com] Sent: Monday, June 16, 2014 11:31 AM To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA) Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jun 16, 2014 at 11:18 AM, Xiangrui Meng [hidden email]/user/SendEmail.jtp?type=nodenode=7684i=0 wrote: Someone is working on weighted regularization. Stay tuned. -Xiangrui On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA) [hidden email]/user/SendEmail.jtp?type=nodenode=7684i=1 wrote: Hi Xiangrui, Thank you for the reply! I have tried customizing LogisticRegressionSGD.optimizer as in the example you mentioned, but the source code reveals that the intercept is also penalized if one is included, which is usually inappropriate. The developer should fix this problem. Best, Congrui -Original Message- From: Xiangrui Meng [mailto:[hidden email]/user/SendEmail.jtp?type=nodenode=7684i=2] Sent: Friday, June 13, 2014 11:50 PM To: [hidden email]/user/SendEmail.jtp?type=nodenode=7684i=3 Cc: user Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression 1. examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala contains example code that shows how to set regParam. 2. A static method with more than 3 parameters becomes hard to remember and hard to maintain. Please use LogistricRegressionWithSGD's default constructor and setters. -Xiangrui If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Missing-Regularization-Parameter-and-Intercept-for-Logistic-Regression-tp7588p7684.html To unsubscribe from MLlib-Missing Regularization Parameter and Intercept for Logistic Regression, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7588code=Zml4ZWQtdGVybS5Db25ncnVpLllpQHVzLmJvc2NoLmNvbXw3NTg4fDEwMDQ0NzI0MDQ=. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Missing-Regularization-Parameter-and-Intercept-for-Logistic-Regression-tp7588p7687.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is There Any Benchmarks Comparing C++ MPI with Spark
I guess you have to understand the difference of architecture. I don't know much about C++ MPI but it is basically MPI whereas Spark is inspired from Hadoop MapReduce and optimised for reading/writing large amount of data with a smart caching and locality strategy. Intuitively, if you have a high ratio CPU/message then MPI might be better. But what is the ratio is hard to say and in the end this ratio will depend on your specific application. Finally, in real life, this difference of performance due to the architecture may not be the only or the most important factor of choice like Michael already explained. Bertrand On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote: Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is faster than Java/Scala will be completely blown by the inordinate amount of time you spend debugging your code and/or reinventing the wheel to do even basic tasks like linear regression.* There are undoubtably some very specialised use-cases where MPI and its brethren still dominate for High Performance Computing tasks -- like for example the nuclear decay simulations run by the US Department of Energy on supercomputers where they've invested billions solving that use case. Spark is part of the wider Big Data ecosystem, and its biggest advantages are traction amongst internet scale companies, hundreds of developers contributing to it and a community of thousands using it. Need a distributed fault-tolerant file system? Use HDFS. Need a distributed/fault-tolerant message-queue? Use Kafka. Need to co-ordinate between your worker processes? Use Zookeeper. Need to run it on a flexible grid of computing resources and handle failures? Run it on Mesos! The barrier to entry to get going with Spark is very low, download the latest distribution and start the Spark shell. Language bindings for Scala / Java / Python are excellent meaning you spend less time writing boilerplate code, and more time solving problems. Even if you believe you *need* to use native code to do something specific, like fetching HD video frames from satellite video capture cards -- wrap it in a small native library and use the Java Native Access interface to call it from your Java/Scala code. Have fun, and if you get stuck we're here to help! MC On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote: Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei
RE: MLlib-a problem of example code for L-BFGS
Thank you! I'll try it out. From: DB Tsai-2 [via Apache Spark User List] [mailto:ml-node+s1001560n7686...@n3.nabble.com] Sent: Monday, June 16, 2014 11:32 AM To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA) Subject: Re: MLlib-a problem of example code for L-BFGS Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under package org.apache.spark.mllib Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jun 13, 2014 at 1:50 PM, Congrui Yi [hidden email]/user/SendEmail.jtp?type=nodenode=7686i=0 wrote: Hi DB, Thank you for the help! I'm new to this, so could you give a bit more details how this could be done? Sincerely, Congrui Yi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7686.html To unsubscribe from MLlib-a problem of example code for L-BFGS, click herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7589code=Zml4ZWQtdGVybS5Db25ncnVpLllpQHVzLmJvc2NoLmNvbXw3NTg5fDEwMDQ0NzI0MDQ=. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7689.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark streaming, kafka, SPARK_CLASSPATH
Did you manage to make it work? I'm facing similar problems and this a serious blocker issue. spark-submit seems kind of broken to me if you can use it for spark-streaming. Regards, Luis 2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com: I am using Spark 1.0.0 compiled with Hadoop 1.2.1. I have a toy spark-streaming-kafka program. It reads from a kafka queue and does stream .map {case (k, v) = (v, 1)} .reduceByKey(_ + _) .print() using a 1 second interval on the stream. The docs say to make Spark and Hadoop jars 'provided' but this breaks for spark-streaming. Including spark-streaming (and spark-streaming-kafka) as 'compile' to sweep them into our assembly gives collisions on javax.* classes. To work around this I modified $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming, spark-streaming-kafka, and zkclient. (Note that kafka is included as 'compile' in my project and picked up in the assembly.) I have set up conf/spark-env.sh as needed. I have copied my assembly to /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory. I am running spark-submit from my spark master. I am guided by the information here https://spark.apache.org/docs/latest/submitting-applications.html Well at this point I was going to detail all the ways spark-submit fails to follow it's own documentation. If I do not invoke sparkContext.setJars() then it just fails to find the driver class. This is using various combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??, and local: prefixes on the application-jar and --jars arguments. If I invoke sparkContext.setJars() and include my assembly jar I get further. At this point I get a failure from kafka.consumer.ConsumerConnector not being found. I suspect this is because spark-streaming-kafka needs the Kafka dependency it but my assembly jar is too late in the classpath. At this point I try setting spark.files.userClassPathfirst to 'true' but this causes more things to blow up. I finally found something that works. Namely setting environment variable SPARK_CLASSPATH=/tmp/myjar.jar But silly me, this is deprecated and I'm helpfully informed to Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath which when put into a file and introduced with --properties-file does not work. (Also tried spark.files.userClassPathFirst here.) These fail with the kafka.consumer.ConsumerConnector error. At a guess what's going on is that using SPARK_CLASSPATH I have my assembly jar in the classpath at SparkSubmit invocation Spark Command: java -cp /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC /tmp/myjar.jar but using --properties-file then the assembly is not available for SparkSubmit. I think the root cause is either spark-submit not handling the spark-streaming libraries so they can be 'provided' or the inclusion of org.elicpse.jetty.orbit in the streaming libraries which cause [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I've tried applying mergeStategy in assembly for my assembly.sbt but then I get Invalid signature file digest for Manifest main attributes If anyone knows the magic to get this working a reply would be greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-SPARK-CLASSPATH-tp7356.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
pyspark-Failed to run first
Hi All, I am just trying to compare Scala and Python API in my local machine. Just tried to import a local matrix(1000 by 10, created in R) stored in a text file via textFile in pyspark. when I run data.first() it fails to present the line and give error messages including the next: Then I did nothing except changing the number of rows to 500 and importing the file again. data.first() runs correctly. I also tried these in scala using spark-shell, which runs correctly for both cases and larger matrices. Could somebody help me with this problem? I couldn't find an answer on the internet. It looks like pyspark has a problem with this simplest step? Best, Congrui Yi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
No Intercept for Python
Hi everyone, The Python LogisticRegressionWithSGD does not appear to estimate an intercept. When I run the following, the returned weights and intercept are both 0.0: from pyspark import SparkContext from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD def main(): sc = SparkContext(appName=NoIntercept) train = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [0]), LabeledPoint(1, [0])]) model = LogisticRegressionWithSGD.train(train, iterations=500, step=0.1) print Final weights: + str(model.weights) print Final intercept: + str(model.intercept) if __name__ == __main__: main() Of course, one can fit an intercept with the simple expedient of adding a column of ones, but that's kind of annoying. Moreover, it looks like the scala version has an intercept option. Am I missing something? Should I just add the column of ones? If I submitted a PR doing that, is that the sort of thing you guys would accept? Thanks! :-) Naftali
Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)
We are having the same problem. We're running Spark 0.9.1 in standalone mode and on some heavy jobs workers become unresponsive and marked by master as dead, even though the worker process is still running. Then they never join the cluster again and cluster becomes essentially unusable until we restart each worker. We'd like to know: 1. Why worker can become unresponsive? Are there any well known config / usage pitfalls that we could have fallen into? We're still investigating the issue, but maybe there are some hints? 2. Is there an option to auto-recover a worker? e.g. automatically start a new one if the old one failed? or at least some hooks to implement functionality liek that? Thanks, Piotr 2014-06-13 22:58 GMT+02:00 Gino Bustelo g...@bustelos.com: I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster. On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again: (here is the original thread I found: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E ) I have a set of workers dying and coming back again. The master prints the following warning: Got heartbeat from unregistered worker What is the solution to this -- rolling the master is very undesirable to me as I have a Shark context sitting on top of it (it's meant to be highly available). Insights appreciated -- I don't think an executor going down is very unexpected but it does seem odd that it won't be able to rejoin the working set. I'm running Spark 0.9.1 on CDH -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com http://www.datastax.com/ 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404
Re: pyspark serializer can't handle functions?
Interesting! I'm curious why you use cloudpickle internally, but then use standard pickle to serialize RDDs? I'd like to create an RDD of functions because (I think) it's the most natural way to express my problem. I have a matrix of functions; I'm trying to find a low rank matrix that minimizes the sum of these functions evaluated on the entries on the low rank matrix. For example, the problem is PCA on the matrix A when the (i,j)th function is lambda z: (z-A[i,j])^2. In general, each of these functions is defined using a two argument base function lambda a,z: (z-a)^2 and the data A[i,j]; but it's somewhat cleaner just to express the minimization problem in terms of the one argument functions. One other wrinkle is that I'm using alternating minimization, so I'll be minimizing over the rows and columns of this matrix at alternating steps; hence I need to store both the matrix and its transpose to avoid data thrashing. On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List] ml-node+s1001560n7682...@n3.nabble.com wrote: It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine [hidden email] http://user/SendEmail.jtp?type=nodenode=7682i=0 wrote: It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sample error message below. Here, regs is a function defined in another file myfile.py that has been included on all workers via the pyFiles argument to SparkContext: sc = SparkContext(local, myapp,pyFiles=[myfile.py]). File runfile.py, line 45, in __init__ regsRDD = sc.parallelize([regs]*self.n) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py, line 223, in parallelize serializer.dump_stream(c, tempFile) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 128, in _write_with_length serialized = self.dumps(obj) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 270, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) cPickle.PicklingError: Can't pickle type 'function': attribute lookup __builtin__.function failed -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7682.html To unsubscribe from pyspark serializer can't handle functions?, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7650code=bWFkZWxlaW5lLnVkZWxsQGdtYWlsLmNvbXw3NjUwfC0yMDUyNTU5NTk5 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Madeleine Udell PhD Candidate in Computational and Mathematical Engineering Stanford University www.stanford.edu/~udell -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7694.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can't get Master Kerberos principal for use as renewer
Hi, I'm a new user to Spark and I'm trying to integrate it in my cluster. It's a small set of nodes running CDH 4.7 with kerberos. The other services are fine with the authentication but I've some troubles with spark. First, I used the parcel available in cloudera manager (SPARK 0.9.0-1.cdh4.6.0.p0.98) Since the cluster has CDH4.7 (not 4.6) I'm not sure if this can create problems. I've also tried with the new spark 1.0.0 with no luck ... I've configured the environment as reported in http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html I'm using a standalone deployment. When launching spark-shell (for testing), everything seems fine (the process got registered with master) But when I try to execute the example reported in the installation page, Kerberos blocks the access to HDFS scala val file = sc.textFile(hdfs:// m1hadoop.polito.it:8020/user/finamore/data) 14/06/16 22:28:36 INFO storage.MemoryStore: ensureFreeSpace(135653) called with curMem=0, maxMem=308713881 14/06/16 22:28:36 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 132.5 KB, free 294.3 MB) file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala val counts = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_ + _) java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:187) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:251) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:354) at $iwC$$iwC$$iwC$$iwC.init(console:14) at $iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC.init(console:21) at $iwC.init(console:23) at init(console:25) at .init(console:29) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876) at
Set comparison
Hi, I have a Spark method that returns RDD[String], which I am converting to a set and then comparing it to the expected output as shown in the following code. 1. val expected_res = Set(ID1, ID2, ID3) // expected output 2. val result:RDD[String] = getData(input) //method returns RDD[String] 3. val set_val = result.collect().toSet // convert to set. 4. println(set_val) // prints: Set(ID1, ID2, ID3) 5. println(expected_res)// prints: Set(ID1, ID2, ID3) // verify output 6. if( set_val == expected_res) 7.println(true) // this does not get printed The value returned by the method is almost same as expected output, but the verification is failing. I am not sure why the expected_res in Line 5 does not print the quotes even though Line 1 has them. Could that be the reason the comparison is failing? What is the right way to do the above comparison? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: spark master UI does not keep detailed application history
Are you referring to accessing a SparkUI for an application that has finished? First you need to enable event logging while the application is still running. In Spark 1.0, you set this by adding a line to $SPARK_HOME/conf/spark-defaults.conf: spark.eventLog.enabled true Other than that, the content served on the master UI is largely the same as before 1.0 is introduced. 2014-06-14 16:43 GMT-07:00 wxhsdp wxh...@gmail.com: hi, zhen i met the same problem in ec2, application details can not be accessed. but i can read stdout and stderr. the problem has not been solved yet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-history-tp7608p7635.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Set comparison
On Mon, Jun 16, 2014 at 10:09 PM, SK skrishna...@gmail.com wrote: The value returned by the method is almost same as expected output, but the verification is failing. I am not sure why the expected_res in Line 5 does not print the quotes even though Line 1 has them. Could that be the reason the comparison is failing? What is the right way to do the above comparison? Most likely because the strings in set_val do actually start and end with double-quotes? That is just what the output suggests. It has a string like ID1, not ID1
Re: Set comparison
In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the quotes during comparison. Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and that seems to work. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: pyspark serializer can't handle functions?
Ah, I see, interesting. CloudPickle is slower than the cPickle library, so that’s why we didn’t use it for data, but it should be possible to write a Serializer that uses it. Another thing you can do for this use case though is to define a class that represents your functions: class MyFunc(object): def __call__(self, argument): return argument f = MyFunc() f(5) Instances of a class like this should be pickle-able using the standard pickle serializer, though you may have to put the class in a separate .py file and include that in the list of .py files passed to SparkContext. And then in the code you can still use them as functions. Matei On Jun 16, 2014, at 1:12 PM, madeleine madeleine.ud...@gmail.com wrote: Interesting! I'm curious why you use cloudpickle internally, but then use standard pickle to serialize RDDs? I'd like to create an RDD of functions because (I think) it's the most natural way to express my problem. I have a matrix of functions; I'm trying to find a low rank matrix that minimizes the sum of these functions evaluated on the entries on the low rank matrix. For example, the problem is PCA on the matrix A when the (i,j)th function is lambda z: (z-A[i,j])^2. In general, each of these functions is defined using a two argument base function lambda a,z: (z-a)^2 and the data A[i,j]; but it's somewhat cleaner just to express the minimization problem in terms of the one argument functions. One other wrinkle is that I'm using alternating minimization, so I'll be minimizing over the rows and columns of this matrix at alternating steps; hence I need to store both the matrix and its transpose to avoid data thrashing. On Mon, Jun 16, 2014 at 11:05 AM, Matei Zaharia [via Apache Spark User List] [hidden email] wrote: It’s true that it can’t. You can try to use the CloudPickle library instead, which is what we use within PySpark to serialize functions (see python/pyspark/cloudpickle.py). However I’m also curious, why do you need an RDD of functions? Matei On Jun 15, 2014, at 4:49 PM, madeleine [hidden email] wrote: It seems that the default serializer used by pyspark can't serialize a list of functions. I've seen some posts about trying to fix this by using dill to serialize rather than pickle. Does anyone know what the status of that project is, or whether there's another easy workaround? I've pasted a sample error message below. Here, regs is a function defined in another file myfile.py that has been included on all workers via the pyFiles argument to SparkContext: sc = SparkContext(local, myapp,pyFiles=[myfile.py]). File runfile.py, line 45, in __init__ regsRDD = sc.parallelize([regs]*self.n) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/context.py, line 223, in parallelize serializer.dump_stream(c, tempFile) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 128, in _write_with_length serialized = self.dumps(obj) File /Applications/spark-0.9.1-bin-hadoop2/python/pyspark/serializers.py, line 270, in dumps def dumps(self, obj): return cPickle.dumps(obj, 2) cPickle.PicklingError: Can't pickle type 'function': attribute lookup __builtin__.function failed -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-serializer-can-t-handle-functions-tp7650p7682.html To unsubscribe from pyspark serializer can't handle functions?, click here. NAML -- Madeleine Udell PhD Candidate in Computational and Mathematical Engineering Stanford University www.stanford.edu/~udell View this message in context: Re: pyspark serializer can't handle functions? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help. Spark + Accumulo = Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String
With the help from the Accumulo guys, I probably know why. I'm using the binary distro of Spark and Base64 is from spark-assembly.jar and it probably uses an older version of commons-codec. I'll need to reinstall spark from source. Jianshi On Mon, Jun 16, 2014 at 9:18 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Check in your driver programs Environment, (eg: http://192.168.1.39:4040/environment/). If you don't see this commons-codec-1.7.jar jar then that's the issue. Thanks Best Regards On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat. It went all well on my laptop (Accumulo MockInstance + Spark Local mode). But when I try to submit it to the yarn cluster, the yarn logs shows the following error message: 14/06/16 02:01:44 INFO cluster.YarnClientClusterScheduler: YarnClientClusterScheduler.postStartHook done Exception in thread main java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String([B)Ljava/lang/String; at org.apache.accumulo.core.client.mapreduce.lib.impl.ConfiguratorBase.setConnectorInfo(ConfiguratorBase.java:127) at org.apache.accumulo.core.client.mapreduce.AccumuloOutputFormat.setConnectorInfo(AccumuloOutputFormat.java:92) at com.paypal.rtgraph.demo.MapReduceWriter$.main(MapReduceWriter.scala:44) at com.paypal.rtgraph.demo.MapReduceWriter.main(MapReduceWriter.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Looks like Accumulo's dependency has got problems. Does Anyone know what's wrong with my code or settings? I've added all needed jars to spark's classpath. I confirmed that commons-codec-1.7.jar has been uploaded to hdfs. 14/06/16 04:36:02 INFO yarn.Client: Uploading file:/x/home/jianshuang/tmp/lib/commons-codec-1.7.jar to hdfs://manny-lvs/user/jianshuang/.sparkStaging/application_1401752249873_12662/commons-codec-1.7.jar And here's my spark-submit cmd (all JARs needed are concatenated after --jars): ~/spark/spark-1.0.0-bin-hadoop2/bin/spark-submit --name 'rtgraph' --class com.paypal.rtgraph.demo.Tables --master yarn --deploy-mode cluster --jars `find lib -type f | tr '\n' ','` --driver-memory 4G --driver-cores 4 --executor-memory 20G --executor-cores 8 --num-executors 2 rtgraph.jar I've tried both cluster mode and client mode and neither worked. BTW, I tried to use sbt-assembly to created a bundled jar, however I always got the following error: [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/ -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
spark with docker: errors with akka, NAT?
Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via boot2docker) fails with following logs. I am pointing the spark driver (which is running the example) to a spark cluster (driver is not part of the cluster). I guess this has something to do with docker's networking stack (it may be getting NAT'd) but I am not sure why (if at all) the spark-worker or spark-master is trying to create a new TCP connection to the driver, instead of responding on the connection initiated by the driver. I would appreciate any help in figuring this out. Thanks, Mohit. logs Spark Executor Command: java -cp ::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar -Xms2g -Xmx2g -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1 cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker app-20140616152201-0021 log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root 14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xxx, xxx) 14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started 14/06/16 15:22:05 INFO Remoting: Starting remoting 14/06/16 15:22:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@:33952/user/Worker 14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for 6 ms, all messages to this address will be delivered to dead letters. 14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@:33536] - [akka.tcp://spark@fc31887475e3:43921] disassociated! Shutting down.
Re: Is There Any Benchmarks Comparing C++ MPI with Spark
Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage compared to MPI: every collective operation requires the task closures to be distributed, and---to my knowledge---this is an O(p) operation. (Perhaps there has been some progress on this??) That O(p) term spoils any parallel isoefficiency analysis. In MPI, binaries are distributed once, and wireup is a O(log p). In practice, it prevents reasonable-looking strong scaling curves; with MPI, the overall runtime will stop declining and level off with increasing p, but with Spark it can go up sharply. So, Spark is great for a small cluster. For a huge cluster, or a job with a lot of collectives, it isn't so great. On Mon, Jun 16, 2014 at 1:36 PM, Bertrand Dechoux decho...@gmail.com wrote: I guess you have to understand the difference of architecture. I don't know much about C++ MPI but it is basically MPI whereas Spark is inspired from Hadoop MapReduce and optimised for reading/writing large amount of data with a smart caching and locality strategy. Intuitively, if you have a high ratio CPU/message then MPI might be better. But what is the ratio is hard to say and in the end this ratio will depend on your specific application. Finally, in real life, this difference of performance due to the architecture may not be the only or the most important factor of choice like Michael already explained. Bertrand On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler mich...@tumra.com wrote: Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is faster than Java/Scala will be completely blown by the inordinate amount of time you spend debugging your code and/or reinventing the wheel to do even basic tasks like linear regression.* There are undoubtably some very specialised use-cases where MPI and its brethren still dominate for High Performance Computing tasks -- like for example the nuclear decay simulations run by the US Department of Energy on supercomputers where they've invested billions solving that use case. Spark is part of the wider Big Data ecosystem, and its biggest advantages are traction amongst internet scale companies, hundreds of developers contributing to it and a community of thousands using it. Need a distributed fault-tolerant file system? Use HDFS. Need a distributed/fault-tolerant message-queue? Use Kafka. Need to co-ordinate between your worker processes? Use Zookeeper. Need to run it on a flexible grid of computing resources and handle failures? Run it on Mesos! The barrier to entry to get going with Spark is very low, download the latest distribution and start the Spark shell. Language bindings for Scala / Java / Python are excellent meaning you spend less time writing boilerplate code, and more time solving problems. Even if you believe you *need* to use native code to do something specific, like fetching HD video frames from satellite video capture cards -- wrap it in a small native library and use the Java Native Access interface to call it from your Java/Scala code. Have fun, and if you get stuck we're here to help! MC On 16 June 2014 08:17, Wei Da xwd0...@gmail.com wrote: Hi guys, We are making choices between C++ MPI and Spark. Is there any official comparation between them? Thanks a lot! Wei
Spark sql unable to connect to db2 hive metastore
Hi, my hive configuration use db2 as it's metastore database, I have built spark with the extra step sbt/sbt assembly/assembly to include the dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf. when I ran : hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) got following exception, pasted portion of the stack trace here, looking at the stack, this made me wondering if Spark supports remote metastore configuration, it seems spark doesn't talk to hiveserver2 directly? the driver jars: db2jcc-10.5.jar, db2jcc_license_cisuz-10.5.jar both are included in the classpath, otherwise, it will complain it couldn't find the driver. Appreciate any help to resolve it. Thanks! Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:db2://localhost:50001/BIDB, username = catalog. Terminating connection pool. Original Exception: -- java.sql.SQLException: No suitable driver at java.sql.DriverManager.getConnection(DriverManager.java:422) at java.sql.DriverManager.getConnection(DriverManager.java:374) at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:254) at com.jolbox.bonecp.BoneCP.init(BoneCP.java:305) at com.jolbox.bonecp.BoneCPDataSource.maybeInit(BoneCPDataSource.java:150) at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:112) at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:479) at org.datanucleus.store.rdbms.RDBMSStoreManager.init(RDBMSStoreManager.java:304) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:56) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:39) at java.lang.reflect.Constructor.newInstance(Constructor.java:527) at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1069) at org.datanucleus.NucleusContext.initialise(NucleusContext.java:359) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:768) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:326) at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) at java.security.AccessController.doPrivileged(AccessController.java:277) at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:275) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:304) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:234) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:209) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RetryingRawStore.init(RetryingRawStore.java:64) at org.apache.hadoop.hive.metastore.RetryingRawStore.getProxy(RetryingRawStore.java:73) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:415) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:402) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:286) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
Re: spark streaming, kafka, SPARK_CLASSPATH
+1 for this issue. Documentation for spark-submit are misleading. Among many issues, the jar support is bad. HTTP urls do not work. This is because spark is using hadoop's FileSystem class. You have to specify the jars twice to get things to work. Once for the DriverWrapper to laid your classes and a 2nd time in the Context to distribute to workers. I would like to see some contrib response to this issue. Gino B. On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez langel.gro...@gmail.com wrote: Did you manage to make it work? I'm facing similar problems and this a serious blocker issue. spark-submit seems kind of broken to me if you can use it for spark-streaming. Regards, Luis 2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com: I am using Spark 1.0.0 compiled with Hadoop 1.2.1. I have a toy spark-streaming-kafka program. It reads from a kafka queue and does stream .map {case (k, v) = (v, 1)} .reduceByKey(_ + _) .print() using a 1 second interval on the stream. The docs say to make Spark and Hadoop jars 'provided' but this breaks for spark-streaming. Including spark-streaming (and spark-streaming-kafka) as 'compile' to sweep them into our assembly gives collisions on javax.* classes. To work around this I modified $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming, spark-streaming-kafka, and zkclient. (Note that kafka is included as 'compile' in my project and picked up in the assembly.) I have set up conf/spark-env.sh as needed. I have copied my assembly to /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory. I am running spark-submit from my spark master. I am guided by the information here https://spark.apache.org/docs/latest/submitting-applications.html Well at this point I was going to detail all the ways spark-submit fails to follow it's own documentation. If I do not invoke sparkContext.setJars() then it just fails to find the driver class. This is using various combinations of absolute path, file:, hdfs: (Warning: Skip remote jar)??, and local: prefixes on the application-jar and --jars arguments. If I invoke sparkContext.setJars() and include my assembly jar I get further. At this point I get a failure from kafka.consumer.ConsumerConnector not being found. I suspect this is because spark-streaming-kafka needs the Kafka dependency it but my assembly jar is too late in the classpath. At this point I try setting spark.files.userClassPathfirst to 'true' but this causes more things to blow up. I finally found something that works. Namely setting environment variable SPARK_CLASSPATH=/tmp/myjar.jar But silly me, this is deprecated and I'm helpfully informed to Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath which when put into a file and introduced with --properties-file does not work. (Also tried spark.files.userClassPathFirst here.) These fail with the kafka.consumer.ConsumerConnector error. At a guess what's going on is that using SPARK_CLASSPATH I have my assembly jar in the classpath at SparkSubmit invocation Spark Command: java -cp /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC /tmp/myjar.jar but using --properties-file then the assembly is not available for SparkSubmit. I think the root cause is either spark-submit not handling the spark-streaming libraries so they can be 'provided' or the inclusion of org.elicpse.jetty.orbit in the streaming libraries which cause [error] (*:assembly) deduplicate: different file contents found in the following: [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I've tried applying mergeStategy in assembly for my assembly.sbt but then I get Invalid signature file digest for Manifest main attributes If anyone knows the magic to get this working a reply would be greatly appreciated. -- View this message in context:
Re: Set comparison
If you want string with quotes, you have to escape it with '\'. It's exactly what you did in the modified version. Sent from my iPhone On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote: In Line 1, I have expected_res as a set of strings with quotes. So I thought it would include the quotes during comparison. Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and that seems to work. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-comparison-tp7696p7699.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
akka.FrameSize
Hi all, I have run into a very interesting bug which is not exactly as same as Spark-1112. Here is how to reproduce the bug, I have one input csv file and use partitionBy function to create an RDD, say repartitionedRDD. The partitionBy function takes the number of partitions as a parameter such that we can vary the serialized size per partition easily for the following experiments. At the end, I just simply call repartitionedRDD.collect(). 1) spark.akka.frameSize = 10 If one of the partition size is very close to 10MB, say 9.97MB, the execution blocks without any exception or warning. Worker finished the task to send the serialized result, and then throw exception saying hadoop IPC client connection stops (changing the logging to debug level). However, the master never receives the results and the program just hangs. But if sizes for all the partitions less than some number btw 9.96MB amd 9.97MB, the program works fine. 2) spark.akka.frameSize = 9 when the partition size is just a little bit smaller than 9MB, it fails as well. This bug behavior is not exactly what spark-1112 is about, could you please guide me how to open a separate bug when the serialization size is very close to 10MB. I googled around and haven't found anything which relates to the behavior we have found. Any insights or suggestions would be greatly appreciated. Thanks! :-) -chen
Re: spark with docker: errors with akka, NAT?
Hi, are you using the amplab/spark-1.0.0 images from the global registry? Andre On 06/17/2014 01:36 AM, Mohit Jaggi wrote: Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via boot2docker) fails with following logs. I am pointing the spark driver (which is running the example) to a spark cluster (driver is not part of the cluster). I guess this has something to do with docker's networking stack (it may be getting NAT'd) but I am not sure why (if at all) the spark-worker or spark-master is trying to create a new TCP connection to the driver, instead of responding on the connection initiated by the driver. I would appreciate any help in figuring this out. Thanks, Mohit. logs Spark Executor Command: java -cp ::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar -Xms2g -Xmx2g -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1 cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker app-20140616152201-0021 log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root 14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xxx, xxx) 14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started 14/06/16 15:22:05 INFO Remoting: Starting remoting 14/06/16 15:22:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@:33952/user/Worker 14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for 6 ms, all messages to this address will be delivered to dead letters. 14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@:33536] - [akka.tcp://spark@fc31887475e3:43921] disassociated! Shutting down.