Spark S3 LZO input files
I'm trying to read input files from S3. The files are compressed using LZO. i-e from spark-shell sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?' Spark does not uncompress the data from the file. I am using cloudera manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS' parcel and have included 'opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar' and '/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/' in SPARK_CLASS_PATH. What am I missing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-S3-LZO-input-files-tp8706.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: MLLib : Math on Vector and Matrix
Hi Thunder, Please understand that both MLlib and breeze are in active development. Before v1.0, we used jblas but in the public APIs we only exposed Array[Double]. In v1.0, we introduced Vector that supports both dense and sparse data and switched the backend to breeze/netlib-java (except ALS). We only used few breeze methods in our implementation and we benchmarked them one by one. It was hard to foresee problems caused by including breeze at that time, for example, https://issues.apache.org/jira/browse/SPARK-1520. Being conservative in v1.0 was not a bad choice. We should benchmark breeze v0.8.1 for v1.1 and consider make toBreeze a developer API. For now, if you are migrating code from v0.9, you can use `Vector.toArray` to get the value array. Sorry for the inconvenience! Best, Xiangrui On Wed, Jul 2, 2014 at 2:42 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: in my humble opinion Spark should've supported linalg a-la [1] before it even started dumping methodologies into mllib. [1] http://mahout.apache.org/users/sparkbindings/home.html On Wed, Jul 2, 2014 at 2:16 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: Thanks. I always hate having to do stuff like this. It seems like they went a bit overboard with all the private[mllib] declarations... possibly all in the name of thou shalt not change your public API. If you don't make your public API usable, we end up having to work around it anyway... Oh well. Thunder On Wed, Jul 2, 2014 at 2:05 PM, Koert Kuipers ko...@tresata.com wrote: i did the second option: re-implemented .toBreeze as .breeze using pimp classes On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of code working with internals of MLLib. One of the big changes was the move from the old jblas.Matrix to the Vector/Matrix classes included in MLLib. However I don't see how we're supposed to use them for ANYTHING other than a container for passing data to the included APIs... how do we do any math on them? Looking at the internal code, there are quite a number of private[mllib] declarations including access to the Breeze representations of the classes. Was there a good reason this was not exposed? I could see maybe not wanting to expose the 'toBreeze' function which would tie it to the breeze implementation, however it would be nice to have the various mathematics wrapped at least. Right now I see no way to code any vector/matrix math without moving my code namespaces into org.apache.spark.mllib or duplicating the code in 'toBreeze' in my own util functions. Not very appealing. What are others doing? thanks, Thunder
Re: MLLib : Math on Vector and Matrix
Hi Dmitriy, It is sweet to have the bindings, but it is very easy to downgrade the performance with them. The BLAS/LAPACK APIs have been there for more than 20 years and they are still the top choice for high-performance linear algebra. I'm thinking about whether it is possible to make the evaluation lazy in bindings. For example, y += a * x can be translated to an AXPY call instead of creating a temporary vector for a*x. There were some work in C++ but none achieved good performance. I'm not sure whether this is a good direction to explore. Best, Xiangrui
Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer
That's good to know. I will try it out. Thanks Romain On Friday, June 27, 2014, Romain Rigaux romain.rig...@gmail.com wrote: So far Spark Job Server does not work with Spark 1.0: https://github.com/ooyala/spark-jobserver So this works only with Spark 0.9 currently: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ Romain Romain On Tue, Jun 24, 2014 at 9:04 AM, Sunita Arvind sunitarv...@gmail.com javascript:_e(%7B%7D,'cvml','sunitarv...@gmail.com'); wrote: Hello Experts, I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have the spark installation build manually from the sources for spark1.0.0. I am able to integrate this with cloudera manager. Background: --- We have a 3 node VM cluster with CDH5.0.1 We requried spark1.0.0 due to some features in it, so I did a yum remove spark-core spark-master spark-worker spark-python of the default spark0.9.0 and compiled spark1.0.0 from source: Downloaded the spark-trunk from git clone https://github.com/apache/spark.git cd spark SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt assembly The spark-assembly-1.0.0-SNAPSHOT-hadoop2.2.0.jar was built and spark by itself seems to work well. I was even able to run a text file count. Current attempt: Referring to this article - http://gethue.com/a-new-spark-web-ui-spark-app/ Now I am trying to add the Spark editor to Hue. AFAIK, this requires git clone https://github.com/ooyala/spark-jobserver.git cd spark-jobserver sbt re-start This was successful after lot of struggle with the proxy settings. However, is this the job Server itself? Will that mean the job Server has to be manually started. I intend to have the spark editor show up in hue web UI and I am no way close. Can some one please help? Note, the 3 VMs are Linux CentOS. Not sure if setting something like can be expected to work.: [desktop] app_blacklist= Also, I have made the changes to vim . /job-server/src/main/resources/application.conf as recommended, however, I do not expect this to impact hue in any way. Also, I intend to let the editor stay available, not spawn it everytime it is required. Thanks in advance. regards
Re: Run spark unit test on Windows 7
It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\) after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote: You don't actually need it per se - its just that some of the Spark libraries are referencing Hadoop libraries even if they ultimately do not call them. When I was doing some early builds of Spark on Windows, I admittedly had Hadoop on Windows running as well and had not run into this particular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: No, I don't why do I need to have HDP installed? I don't use Hadoop at all and I'd like to read data from local filesystem On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote: By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote: Hi Konstatin, We use hadoop as a library in a few places in Spark. I wonder why the path includes null though. Could you provide the full stack trace? Andrew 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com: Hi all, I'm trying to run some transformation on *Spark*, it works
Re: installing spark 1 on hadoop 1
Are you having sbt directory inside your spark directory? Thanks Best Regards On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com wrote: Hi, I'm trying to install spark 1 on my hadoop cluster running on EMR. I didn't have any problem installing the previous versions, but on this version I couldn't find any 'sbt' folder. However, the README still suggests using this to install Spark: ./sbt/sbt assembly which fails: ./sbt/sbt: No such file or directory any suggestions? thanks, imran
Re: Spark SQL - groupby
Hi, Can someone provide me pointers for this issue. Thanks Subacini On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote: Hi, Below code throws compilation error , not found: *value Sum* . Can someone help me on this. Do i need to add any jars or imports ? even for Count , same error is thrown val queryResult = sql(select * from Table) queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate( *Sum*('totB)).collect().foreach(println) Thanks subacini
Re: installing spark 1 on hadoop 1
If you have downloaded the pre-compiled binary, it will not have sbt directory inside it. Thanks Best Regards On Thu, Jul 3, 2014 at 12:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you having sbt directory inside your spark directory? Thanks Best Regards On Wed, Jul 2, 2014 at 10:17 PM, Imran Akbar im...@infoscoutinc.com wrote: Hi, I'm trying to install spark 1 on my hadoop cluster running on EMR. I didn't have any problem installing the previous versions, but on this version I couldn't find any 'sbt' folder. However, the README still suggests using this to install Spark: ./sbt/sbt assembly which fails: ./sbt/sbt: No such file or directory any suggestions? thanks, imran
Re: RDD join: composite keys
Hi Sameer, If you set those two IDs to be a Tuple2 in the key of the RDD, then you can join on that tuple. Example: val rdd1: RDD[Tuple3[Int, Int, String]] = ... val rdd2: RDD[Tuple3[Int, Int, String]] = ... val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join( rdd2.map(k = ((k._1, k._2), k.)3))) Note that when using .join though, that is an inner join so you only get results from (id1, id2) pairs that have BOTH a score1 and a score2. Andrew On Wed, Jul 2, 2014 at 5:12 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, Is it possible to join RDDs using composite keys? I would like to join these two RDDs with RDD1.id1 = RDD2.id1 and RDD1.id2 RDD2.id2 RDD1 (id1, id2, scoretype1) RDD2 (id1, id2, scoretype2) I want the result to be ResultRDD = (id1, id2, (score1, score2)) Would really appreciate if you can point me in the right direction.
Re: Spark SQL - groupby
Hi, You need to import Sum and Count like: import org.apache.spark.sql.catalyst.expressions.{Sum,Count} // or with wildcard _ or if you use current master branch build, you can use sum('colB) instead of Sum('colB). Thanks. 2014-07-03 16:09 GMT+09:00 Subacini B subac...@gmail.com: Hi, Can someone provide me pointers for this issue. Thanks Subacini On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote: Hi, Below code throws compilation error , not found: value Sum . Can someone help me on this. Do i need to add any jars or imports ? even for Count , same error is thrown val queryResult = sql(select * from Table) queryResult.groupBy('colA)('colA,Sum('colB) as 'totB).aggregate(Sum('totB)).collect().foreach(println) Thanks subacini -- Takuya UESHIN Tokyo, Japan http://twitter.com/ueshin
Re: Shark Vs Spark SQL
add MASTER=yarn-client then the JDBC / Thrift server will run on yarn 2014-07-02 16:57 GMT-07:00 田毅 tia...@asiainfo.com: hi, Matei Do you know how to run the JDBC / Thrift server on Yarn? I did not find any suggestion in docs. 2014-07-02 16:06 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com: Spark SQL in Spark 1.1 will include all the functionality in Shark; take a look at http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html. We decided to do this because at the end of the day, the only code left in Shark was the JDBC / Thrift server, which is a very small amount of code. There’s also a branch of Spark 1.0 that includes this server if you want to replace Shark on Spark 1.0: https://github.com/apache/spark/tree/branch-1.0-jdbc. The server runs in a very similar way to how Shark did. Matei On Jul 2, 2014, at 3:57 PM, Shrikar archak shrika...@gmail.com wrote: As of the spark summit 2014 they mentioned that there will be no active development on shark. Thanks, Shrikar On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote: Hi, http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E This talks about Shark backend will be replaced with Spark SQL engine in future. Does that mean Spark will continue to support Shark + Spark SQL for long term? OR After some period, Shark will be decommissioned ?? Thanks Subacini
Case class in java
Hi, I'm trying to convert scala spark job into java. In case of scala, I typically use 'case class' to apply schema to RDD. It can be converted into POJO class in java, but what I really want to do is dynamically creating POJO classes like scala REPL do. For this reason, I import javassist to create POJO class in runtime easily. But the problem is Worker nodes can't find this class. The error message is.. host workernode2.com: java.lang.ClassNotFoundException: GeneratedClass_no1 java.net.URLClassLoader$1.run(URLClassLoader.java:366) java.net.URLClassLoader$1.run(URLClassLoader.java:355) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:354) java.lang.ClassLoader.loadClass(ClassLoader.java:423) java.lang.ClassLoader.loadClass(ClassLoader.java:356) java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:266) Generated class's classloader is 'Thread.currentThread().getContextClassLoader()'. I expect it can be visible for Driver-node but Worker node's executor can not see it. Are changing classloader for loading Generated class and broadcasting Generated class by spark context effective? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: write event logs with YARN
Hi Andrew, This does not work (the application failed), I have the following error when I put 3 slashes in the hdfs scheme: (...) Caused by: java.lang.IllegalArgumentException: Pathname /dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442 from hdfs:/dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442 is not a valid DFS filename. (...) Besides, I do not think that there is an issue with the hdfs path name since only the empty APPLICATION_COMPLETE file is missing (with spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events), all other files are correctly created, e.g.: hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470 Found 3 items -rwxrwx--- 1 kookel supergroup 0 2014-07-03 08:29 spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec -rwxrwx--- 1 kookel supergroup 137948 2014-07-03 08:32 spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2 -rwxrwx--- 1 kookel supergroup 0 2014-07-03 08:29 spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0 You help is appreciated though, do not hesitate if you have any other idea on how to fix this. Thanks, Christophe. On 03/07/2014 01:49, Andrew Lee wrote: Hi Christophe, Make sure you have 3 slashes in the hdfs scheme. e.g. hdfs:///server_name:9000/user/user_name/spark-events and in the spark-defaults.conf as well. spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events Date: Thu, 19 Jun 2014 11:18:51 +0200 From: christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com To: user@spark.apache.orgmailto:user@spark.apache.org Subject: write event logs with YARN Hi, I am trying to use the new Spark history server in 1.0.0 to view finished applications (launched on YARN), without success so far. Here are the relevant configuration properties in my spark-defaults.conf: spark.yarn.historyServer.address=server_name:18080 spark.ui.killEnabled=false spark.eventLog.enabled=true spark.eventLog.compress=true spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events And the history server has been launched with the command below: /opt/spark/sbin/start-history-server.sh hdfs://server_name:9000/user/user_name/spark-events However, the finished application do not appear in the history server UI (though the UI itself works correctly). Apparently, the problem is that the APPLICATION_COMPLETE file is not created: hdfs dfs -stat %n spark-events/application_name-1403166516102/* COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec EVENT_LOG_2 SPARK_VERSION_1.0.0 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above directory, the application can now be viewed normally in the history server. Finally, here is the relevant part of the YARN application log, which seems to imply that the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is created: (...) 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal. 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1397477394591_0798 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from shutdown hook 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all executors 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted! 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared 14/06/19 08:29:30 INFO BlockManager: BlockManager stopped 14/06/19 08:29:30 INFO BlockManagerMasterActor: Stopping BlockManagerMaster 14/06/19 08:29:30 INFO BlockManagerMaster: BlockManagerMaster stopped Exception in thread Thread-44 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1365) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1307) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:384) at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:380) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at
Re: java options for spark-1.0.0
With spark-1.0.0 this is the cmdline from /proc/#pid: (with the export line export _JAVA_OPTIONS=...) /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms512m-Xmx512morg.apache.spark.deploy.SparkSubmit--classSparkKMeans--verbose--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 This is the cmdline from /proc/#pid with spark-0.8.0 and launching KMeans with scala -J-Xms16g -J-Xms16g . The export line from bashrc is ignored here also (If I do launch without specifying the java options after the scala command , the heap will have the default value) - the results below are from launching it with the java options specified after the scala command: /usr/java/jdk1.7.0_51/bin/java-Xmx256M-Xms32M-Xms16g-Xmx16g-Xbootclasspath/a:/home/spark2013/scala-2.9.3/lib/jline.jar:/home/spark2013/scala-2.9.3/lib/scalacheck.jar:/home/spark2013/scala-2.9.3/lib/scala-compiler.jar:/home/spark2013/scala-2.9.3/lib/scala-dbc.jar:/home/spark2013/scala-2.9.3/lib/scala-library.jar:/home/spark2013/scala-2.9.3/lib/scala-partest.jar:/home/spark2013/scala-2.9.3/lib/scalap.jar:/home/spark2013/scala-2.9.3/lib/scala-swing.jar-Dscala.usejavacp=true-Dscala.home=/home/spark2013/scala-2.9.3-Denv.emacs=scala.tools.nsc.MainGenericRunner-J-Xms16g-J-Xmx16g-cp/home/spark2013/Runs/KMeans/GC/classesSparkKMeanslocal[24]/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 Launching spark-1.0.0 with spark-submit and --driver-memory-10g gets picked up, but the results in the execution are the same, a lot of alocation failures /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 Adding --executor-memory 11g will not change the outcome: cat /proc/13286/cmdline /usr/java/jdk1.8.0_05/bin/java-cp::/home/spark2013/spark-1.0.0/conf:/home/spark2013/spark-1.0.0/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-core-3.2.2.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-rdbms-3.2.1.jar:/home/spark2013/spark-1.0.0/lib/datanucleus-api-jdo-3.2.1.jar-XX:MaxPermSize=128m-Djava.library.path=-Xms10g-Xmx10gorg.apache.spark.deploy.SparkSubmit--driver-memory10g--executor-memory11g--classSparkKMeans--masterlocal[24]/home/spark2013/KMeansWorkingDirectory/target/scala-2.10/sparkkmeans_2.10-1.0.jar/home/spark2013/sparkRun/fisier_16mil_30D_R10k.txt10240.001 So the Xmx and Xms can be altered, but the execution is rubbish in performance compared to spark 0.8.0. How can I improve it ? Thanks On Wednesday, July 2, 2014 9:34 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try looking at the running processes with “ps” to see their full command line and see whether any options are different. It seems like in both cases, your young generation is quite large (11 GB), which doesn’t make lot of sense with a heap of 15 GB. But maybe I’m misreading something. Matei On Jul 2, 2014, at 4:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering that
Which version of Hive support Spark Shark
Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T
Re: Case class in java
I found a web page for hint. http://ardoris.wordpress.com/2014/03/30/how-spark-does-class-loading/ I learned SparkIMain has internal httpserver to publish class object but can't figure out how I use it in java. Any ideas? Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720p8724.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
hdfs short circuit
can i enable spark to use dfs.client.read.shortcircuit property to improve performance and ready natively on local nodes instead of hdfs api ? The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Re: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)
I've had some odd behavior with jobs showing up in the history server in 1.0.0. Failed jobs do show up but it seems they can show up minutes or hours later. I see in the history server logs messages about bad task ids. But then eventually the jobs show up. This might be your situation. Anecdotally, if you click on the job in the Spark Master GUI after it is done, this may help it show up in the history server faster. Haven't reliably tested this though. May just be a coincidence of timing. -Suren On Wed, Jul 2, 2014 at 8:01 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, I have HistoryServer up and running, and it is great. Is it possible to also enable HsitoryServer to parse failed jobs event by default as well? I get No Completed Applications Found if job fails. - - *= Event Log Location: *hdfs:///user/test01/spark/logs/ No Completed Applications Found = The reason is that it is good to run the HistoryServer to keep track of performance and resource usage for each completed job, but I found it more useful when job fails. I can identify which stage did it fail, etc instead of sipping through the logs from the Resource Manager. The same event log is only available when the Application Master is still active, once the job fails, the Application Master is killed, and I lose the GUI access, even though I have the event log in JSON format, I can't open it with the HistoryServer. This is very helpful especially for long running jobs that last for 2-18 hours that generates Gigabytes of logs. So I have 2 questions: 1. Any reason why we only render completed jobs? Why can't we bring in all jobs and choose from the GUI? Like a time machine to restore the status from the Application Master? ./core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala val logInfos = logDirs .sortBy { dir = getModificationTime(dir) } .map { dir = (dir, EventLoggingListener.parseLoggingInfo(dir.getPath, fileSystem)) } .filter { case (dir, info) = info.*applicationComplete* } 2. If I force to touch a file APPLICATION_COMPLETE in the failed job event log folder, will this cause any problem? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
I have given this a try in a spark-shell and I still get many Allocation Failures On Thursday, July 3, 2014 9:51 AM, Xiangrui Meng men...@gmail.com wrote: The SparkKMeans is just an example code showing a barebone implementation of k-means. To run k-means on big datasets, please use the KMeans implemented in MLlib directly: http://spark.apache.org/docs/latest/mllib-clustering.html -Xiangrui On Wed, Jul 2, 2014 at 9:50 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I can run it now with the suggested method. However, I have encountered a new problem that I have not faced before (sent another email with that one but here it goes again ...) I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a decent time, ~50 seconds, and I had only a few Full GC messages from Java. (a max of 4-5) Now, using the same export in bash.rc but with spark-1.0.0 (and running it with spark-submit) the first loop never finishes and I get a lot of: 18.537: [GC (Allocation Failure) --[PSYoungGen: 11796992K-11796992K(13762560K)] 11797442K-11797450K(13763072K), 2.8420311 secs] [Times: user=5.81 sys=2.12, real=2.85 secs] or 31.867: [Full GC (Ergonomics) [PSYoungGen: 11796992K-3177967K(13762560K)] [ParOldGen: 505K-505K(512K)] 11797497K-3178473K(13763072K), [Metaspace: 37646K-37646K(1081344K)], 2.3053283 secs] [Times: user=37.74 sys=0.11, real=2.31 secs] I tried passing different parameters for the JVM through spark-submit, but the results are the same This happens with java 1.7 and also with java 1.8. I do not know what the Ergonomics stands for ... How can I get a decent performance from spark-1.0.0 considering that spark-0.8.0 did not need any fine tuning on the gargage collection method (the default worked well) ? Thank you On Wednesday, July 2, 2014 4:45 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: The scripts that Xiangrui mentions set up the classpath...Can you run ./run-example for the provided example sucessfully? What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call run-example -- that will show you the exact java command used to run the example at the start of execution. Assuming you can run examples succesfully, you should be able to just copy that and add your jar to the front of the classpath. If that works you can start removing extra jars (run-examples put all the example jars in the cp, which you won't need) As you said the error you see is indicative of the class not being available/seen at runtime but it's hard to tell why. On Wed, Jul 2, 2014 at 2:13 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: I want to make some minor modifications in the SparkMeans.scala so running the basic example won't do. I have also packed my code under a jar file with sbt. It completes successfully but when I try to run it : java -jar myjar.jar I get the same error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) If scalac -d classes/ SparkKMeans.scala can't see my classpath, why does it succeeds in compiling and does not give the same error ? The error itself NoClassDefFoundError means that the files are available at compile time, but for some reason I cannot figure out they are not available at run time. Does anyone know why ? Thank you On Tuesday, July 1, 2014 7:03 PM, Xiangrui Meng men...@gmail.com wrote: You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at
Reading text file vs streaming text files
Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes. Data from the file will used to update counters in cassandra (non-idempotent operation). I was wondering what is the best to deal with this: * Use text streaming and process the files as they are added to the directory * Use non-streaming text input and launch a spark driver every 15 minutes to process files from a specified directory (new directory for every 15 minutes). * Use message queue to ingest data from the files and then read data from the queue. Also, is there a way to to find which text file is being processed and when a file has been processed for both the streaming and non-streaming RDDs. I believe filename is available in the WholeTextFileInputFormat but is it available in standard or streaming text RDDs. Thanks Mans
matchError:null in ALS.train
Hi All, We are using ALS.train to generate a model for predictions. We are using DStream[] to collect the predicted output and then trying to dump in a text file using these two approaches dstream.saveAsTextFiles() and dstream.foreachRDD(rdd=rdd.saveAsTextFile).But both these approaches are giving us the following error : Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:43) MyOperator$$anonfun$7.apply(MyOperator.scala:213) MyOperator$$anonfun$7.apply(MyOperator.scala:180) scala.collection.Iterator$$anon$11.next(Ite rator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) We tried it in both spark 0.9.1 as well as 1.0.0 ;scala:2.10.3. Can anybody help me with the issue. Thank You Regards Honey Joshi Ideata-Analytics
Re: Reading text file vs streaming text files
Hi Singh! For this use-case its better to have a Streaming context listening to that directory in hdfs where the files are being dropped and you can set the Streaming interval as 15 minutes and let this driver program run continuously, so as soon as new files are arrived they are taken for processing in every 15 minutes. In this way, you don't have to worry about the old files unless you are about to restart the driver program. Another implementation would be after processing of each batch, you can simply move those processed files to another directory or so. Thanks Best Regards On Thu, Jul 3, 2014 at 6:34 PM, M Singh mans6si...@yahoo.com wrote: Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes. Data from the file will used to update counters in cassandra (non-idempotent operation). I was wondering what is the best to deal with this: - Use text streaming and process the files as they are added to the directory - Use non-streaming text input and launch a spark driver every 15 minutes to process files from a specified directory (new directory for every 15 minutes). - Use message queue to ingest data from the files and then read data from the queue. Also, is there a way to to find which text file is being processed and when a file has been processed for both the streaming and non-streaming RDDs. I believe filename is available in the WholeTextFileInputFormat but is it available in standard or streaming text RDDs. Thanks Mans
Re: Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition
On Wed, July 2, 2014 2:00 am, Mayur Rustagi wrote: two job context cannot share data, are you collecting the data to the master then sending it to the other context? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi honeyjo...@ideata-analytics.com wrote: On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote: Ideally you should be converting RDD to schemardd ? You are creating UnionRDD to join across dstream rdd? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi honeyjo...@ideata-analytics.com wrote: Hi, I am trying to run a project which takes data as a DStream and dumps the data in the Shark table after various operations. I am getting the following error : Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure: java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$s ched uler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArra y.sc ala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala :102 6) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap ply( DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.ap ply( DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.s cala :619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$a nonf un$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Ab stra ctDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260 ) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPo ol.j ava:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java: 1979 ) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerTh read .java:107) Can someone please explain the cause of this error, I am also using a Spark Context with the existing Streaming Context. I am using spark 0.9.0-Incubating, so it doesnt have anything to do with schemaRDD.This error is probably coming when I am trying to use one spark context and one shark context in the same job.Is there any way to incorporate two context in one job? Regards Honey Joshi Ideata-Analytics Both of these contexts are independently executing but they were still giving us issues, mostly because of the lazy evaluation in scala.This error is probably coming when I am trying to use one spark context and one shark context in the same job.Got it resolved by stopping the existing spark context before calling the shark context. Thanks for your help Mayur.
Re: [mllib] strange/buggy results with RidgeRegressionWithSGD
Printing the model show the intercept is always 0 :( Should I open a bug for that ? 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT eusta...@diemert.fr: Hi list, I'm benchmarking MLlib for a regression task [1] and get strange results. Namely, using RidgeRegressionWithSGD it seems the predicted points miss the intercept: {code} val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000) ... valuesAndPreds.take(10).map(t = println(t)) {code} output: (2007.0,-3.784588726958493E75) (2003.0,-1.9562390324037716E75) (2005.0,-4.147413202985629E75) (2003.0,-1.524938024096847E75) ... If I change the parameters (step size, regularization and iterations) I get NaNs more often than not: (2007.0,NaN) (2003.0,NaN) (2005.0,NaN) ... On the other hand DecisionTree model give sensible results. I see there is a `setIntercept()` method in abstract class GeneralizedLinearAlgorithm that seems to trigger the use of the intercept but I'm unable to use it from the public interface :( Any help appreciated :) Eustache [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster
Hello, I am running the BroadcastTest example in a standalone cluster using spark-submit. I have 8 host machines and made Host1 the master. Host2 to Host8 act as 7 workers to connect to the master. The connection was fine as I could see all 7 hosts on the master web ui. The BroadcastTest example with Http broadcast also works fine, I think, as there was no error msg and all workers EXITED at the end. But when I changed the third argument from Http to Torrent to use Torrent broadcast, all workers got a KILLED status once they reached sc.stop(). Below is the stderr on one of the workers when running Torrent broadcast (I masked the IP addresses): == 14/07/02 18:20:03 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771] 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771] 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661] 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661] 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/CoarseGrainedScheduler 14/07/02 18:20:04 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker 14/07/02 18:20:04 INFO Remoting: Remoting shut down 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/07/02 18:20:04 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@dyn-xxx-xx-xx-xx:57883] 14/07/02 18:20:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@dyn-xxx-xx-xx-xx:57883] 14/07/02 18:20:05 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/MapOutputTracker 14/07/02 18:20:05 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/BlockManagerMaster 14/07/02 18:20:05 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140702182005-30bd 14/07/02 18:20:05 INFO ConnectionManager: Bound socket to port 60368 with id = ConnectionManagerId(dyn-xxx-xx-xx-xx,60368) 14/07/02 18:20:05 INFO MemoryStore: MemoryStore started with capacity 294.6 MB 14/07/02 18:20:05 INFO BlockManagerMaster: Trying to register BlockManager 14/07/02 18:20:05 INFO BlockManagerMaster: Registered BlockManager 14/07/02 18:20:05 INFO HttpFileServer: HTTP File server directory is /tmp/spark-35f65442-e0e8-4122-9359-ca8232ca97a6 14/07/02 18:20:05 INFO HttpServer: Starting HTTP Server 14/07/02 18:20:06 INFO CoarseGrainedExecutorBackend: Got assigned task 9 14/07/02 18:20:06 INFO Executor: Running task ID 9 14/07/02 18:20:06 INFO Executor: Fetching http://xxx.xx.xx.xx:54292/jars/broadcast-test_2.10-1.0.jar with timestamp 1404339601903 14/07/02 18:20:06 INFO Utils: Fetching http://xxx.xx.xx.xx:54292/jars/broadcast-test_2.10-1.0.jar to /tmp/fetchFileTemp5382215579021312284.tmp 14/07/02 18:20:06 INFO BlockManager: Removing broadcast 0 14/07/02 18:20:07 INFO Executor: Adding file:/home/lrl/Desktop/spark-master/work/app-20140702182002-0006/3/./broadcast-test_2.10-1.0.jar to class loader 14/07/02 18:20:07 INFO TorrentBroadcast: Started reading broadcast variable 1 14/07/02 18:20:07 INFO SendingConnection: Initiating connection to [dyn-xxx-xx-xx-xx:60179] 14/07/02 18:20:07 INFO SendingConnection: Connected to [dyn-xxx-xx-xx-xx:60179], 1
Re: Run spark unit test on Windows 7
Hi Konstantin, Could you please create a jira item at: https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev (kudryavtsev.konstan...@gmail.com) wrote: It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\) after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote: You don't actually need it per se - its just that some of the Spark libraries are referencing Hadoop libraries even if they ultimately do not call them. When I was doing some early builds of Spark on Windows, I admittedly had Hadoop on Windows running as well and had not run into this particular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: No, I don’t why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to read data from local filesystem On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote: By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 8:15 PM, Andrew Or and...@databricks.com wrote: Hi Konstatin, We use hadoop as a library in a few places in Spark. I wonder why the path includes null though. Could you provide the full stack trace?
Re: Kafka - streaming from multiple topics
That’s an obvious workaround, yes, thank you Tobias. However, I’m prototyping substitution to real batch process, where I’d have to create six streams (and possibly more). It could be a bit messy. On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call, calls ConsumerConnector.createMessageStream which returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed. So Kafka does provide capability of creating multiple streams based on topic, but Spark doesn’t use it, which is unfortunate. Sergey From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Wednesday, July 2, 2014 at 9:54 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Kafka - streaming from multiple topics Sergey, you might actually consider using two streams, like val stream1 = KafkaUtils.createStream(ssc,localhost:2181,logs, Map(retarget - 2)) val stream2 = KafkaUtils.createStream(ssc,localhost:2181,logs, Map(datapair - 2)) to achieve what you want. This has the additional advantage that there are actually two connections to Kafka and data is possibly received on different cluster nodes, already increasing parallelity in an early stage of processing. Tobias On Thu, Jul 3, 2014 at 6:47 AM, Sergey Malov sma...@collective.commailto:sma...@collective.com wrote: HI, I would like to set up streaming from Kafka cluster, reading multiple topics and then processing each of the differently. So, I’d create a stream val stream = KafkaUtils.createStream(ssc,localhost:2181,logs, Map(retarget - 2,datapair - 2)) And then based on whether it’s “retarget” topic or “datapair”, set up different filter function, map function, reduce function, etc. Is it possible ? I’d assume it should be, since ConsumerConnector can map of KafkaStreams keyed on topic, but I can’t find that it would be visible to Spark. Thank you, Sergey Malov
Re: reduceByKey Not Being Called by Spark Streaming
Hi All, I was able to resolve this matter with a simple fix. It seems that in order to process a reduceByKey and the flat map operations at the same time, the only way to resolve was to increase the number of threads to 1. Since I'm developing on my personal machine for speed, I simply updated the sparkURL argument to: private static String sparkURL = local[2]; //Instead of local ,which is then used by the JavaStreamingContext method as a parameter. After I made this change, I was able to see the reduceByKey values properly aggregated and counted. Best Regards, D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKey-Not-Being-Called-by-Spark-Streaming-tp8684p8739.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: write event logs with YARN
Hi Christophe, another Andrew speaking. Your configuration looks fine to me. From the stack trace it seems that we are in fact closing the file system pre-maturely elsewhere in the system, such that when it tries to write the APPLICATION_COMPLETE file it throws the exception you see. This does look like a potential bug in Spark. Tracing the source of this may take a little, but we will start looking into it. I'm assuming if you manually create your own APPLICATION_COMPLETE file then the entries should show up. Unfortunately I don't see another workaround for this, but we'll fix this as soon as possible. Andrew 2014-07-03 1:44 GMT-07:00 Christophe Préaud christophe.pre...@kelkoo.com: Hi Andrew, This does not work (the application failed), I have the following error when I put 3 slashes in the hdfs scheme: (...) Caused by: java.lang.IllegalArgumentException: Pathname / dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442 from hdfs:/ dc1-ibd-corp-hadoop-01.corp.dc1.kelkoo.net:9000/user/kookel/spark-events/kelkoo.searchkeywordreport-1404374686442 is not a valid DFS filename. (...) Besides, I do not think that there is an issue with the hdfs path name since only the empty APPLICATION_COMPLETE file is missing (with spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events), all other files are correctly created, e.g.: hdfs dfs -ls spark-events/kelkoo.searchkeywordreport-1404376178470 Found 3 items -rwxrwx--- 1 kookel supergroup 0 2014-07-03 08:29 spark-events/kelkoo.searchkeywordreport-1404376178470/COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec -rwxrwx--- 1 kookel supergroup 137948 2014-07-03 08:32 spark-events/kelkoo.searchkeywordreport-1404376178470/EVENT_LOG_2 -rwxrwx--- 1 kookel supergroup 0 2014-07-03 08:29 spark-events/kelkoo.searchkeywordreport-1404376178470/SPARK_VERSION_1.0.0 You help is appreciated though, do not hesitate if you have any other idea on how to fix this. Thanks, Christophe. On 03/07/2014 01:49, Andrew Lee wrote: Hi Christophe, Make sure you have 3 slashes in the hdfs scheme. e.g. hdfs:*///*server_name:9000/user/user_name/spark-events and in the spark-defaults.conf as well. spark.eventLog.dir=hdfs:*///* server_name:9000/user/user_name/spark-events Date: Thu, 19 Jun 2014 11:18:51 +0200 From: christophe.pre...@kelkoo.com To: user@spark.apache.org Subject: write event logs with YARN Hi, I am trying to use the new Spark history server in 1.0.0 to view finished applications (launched on YARN), without success so far. Here are the relevant configuration properties in my spark-defaults.conf: spark.yarn.historyServer.address=server_name:18080 spark.ui.killEnabled=false spark.eventLog.enabled=true spark.eventLog.compress=true spark.eventLog.dir=hdfs://server_name:9000/user/user_name/spark-events And the history server has been launched with the command below: /opt/spark/sbin/start-history-server.sh hdfs://server_name:9000/user/user_name/spark-events However, the finished application do not appear in the history server UI (though the UI itself works correctly). Apparently, the problem is that the APPLICATION_COMPLETE file is not created: hdfs dfs -stat %n spark-events/application_name-1403166516102/* COMPRESSION_CODEC_org.apache.spark.io.LZFCompressionCodec EVENT_LOG_2 SPARK_VERSION_1.0.0 Indeed, if I manually create an empty APPLICATION_COMPLETE file in the above directory, the application can now be viewed normally in the history server. Finally, here is the relevant part of the YARN application log, which seems to imply that the DFS Filesystem is already closed when the APPLICATION_COMPLETE file is created: (...) 14/06/19 08:29:29 INFO ApplicationMaster: finishApplicationMaster with SUCCEEDED 14/06/19 08:29:29 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 14/06/19 08:29:29 INFO ApplicationMaster: AppMaster received a signal. 14/06/19 08:29:29 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1397477394591_0798 14/06/19 08:29:29 INFO ApplicationMaster$$anon$1: Invoking sc stop from shutdown hook 14/06/19 08:29:29 INFO SparkUI: Stopped Spark web UI at http://dc1-ibd-corp-hadoop-02.corp.dc1.kelkoo.net:54877 14/06/19 08:29:29 INFO DAGScheduler: Stopping DAGScheduler 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Shutting down all executors 14/06/19 08:29:29 INFO CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/06/19 08:29:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/06/19 08:29:30 INFO ConnectionManager: Selector thread was interrupted! 14/06/19 08:29:30 INFO ConnectionManager: ConnectionManager stopped 14/06/19 08:29:30 INFO MemoryStore: MemoryStore cleared 14/06/19 08:29:30 INFO BlockManager:
spark text processing
Hi: Is there a way to find out when spark has finished processing a text file (both for streaming and non-streaming cases) ? Also, after processing, can spark copy the file to another directory ? Thanks
Re: issue with running example code
Just to provide more information on this issue. It seems that SPARK_HOME environment variable is causing the issue. If I unset the variable in spark-class script and run in the local mode my code runs fine without the exception. But if I run with SPARK_HOME, I get the exception mentioned below. I could run without setting SPARK_HOME but it is not possible to run in the cluster settings, as this tells where is spark on worker nodes. E.g. we are using Mesos as cluster manager, thus when set master to mesos we get the exception as SPARK_HOME is not set. Just to mention again the pyspark works fine as well as spark-shell, only when we are running compiled jar it seems SPARK_HOME causes some java run time issues that we get class cast exception. Thanks, Gurvinder On 07/01/2014 09:28 AM, Gurvinder Singh wrote: Hi, I am having issue in running scala example code. I have tested and able to run successfully python example code, but when I run the scala code I get this error java.lang.ClassCastException: cannot assign instance of org.apache.spark.examples.SparkPi$$anonfun$1 to field org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of org.apache.spark.rdd.MappedRDD I have compiled spark from the github directly and running with the command as spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar --class org.apache.spark.examples.SparkPi 5 --jars /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar Any suggestions will be helpful. Thanks, Gurvinder
Re: MLLib : Math on Vector and Matrix
On Wed, Jul 2, 2014 at 11:40 PM, Xiangrui Meng men...@gmail.com wrote: Hi Dmitriy, It is sweet to have the bindings, but it is very easy to downgrade the performance with them. The BLAS/LAPACK APIs have been there for more than 20 years and they are still the top choice for high-performance linear algebra. There's no such limitation there. In fact, LAPACK/jblas is very easy fruit to have there. algebraic optimizer is not about so much about in-core block-on-block techniques. It is about optimizing/simplification of algebraic expressions, especially their distributed plans/side of things. Another side of the story is consistent matrix representation for block-2-block in-memory computations and passing stuff in and out. R-like look feel. It is true that in-core-only computations currently are not deferrably optimized, nor do they have LAPack back but this is a low hanging fruit there. the main idea is consistency of algebraic API/DSL, be it distributed or in-core, and having algebraic optimizer, and pluggable backs (both in-core backs or distributed engine backs as well). It is so happened the only in-memory back right now is Mahout's Colt derivation, but there's fairly little reason not to pick-plug Breeze/Lapack, or say GPU -backed representations. In fact, that was my first attempt a year ago (Breeze) but it unfortunately was not where it needed to be (not sure about now). As for LAPack, yes it is easy to integrate. But the only reason I (personally) haven't integrated it yet is because my problems tend to be sparse, not dense, and also fairly invasive in terms of custom matrix traversals (probabilistic fitting, for the most part). So most specifically tweaked methodologies are thus really more quasi-algebraic than purely algebraic, unfortunately. Having LAPack blockwise operartors on dense matrices would not help me terribly there. But the architectural problem in terms of foundation, and, more specifically, customization of processes IMO does exist here (in mllib). This thread (and there was another one just like this a few threads below this one) are read by me as the manifestation of such lack of algebraic foundation apis/optimizers.
Re: Run spark unit test on Windows 7
Hi Denny, just created https://issues.apache.org/jira/browse/SPARK-2356 On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote: Hi Konstantin, Could you please create a jira item at: https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev (kudryavtsev.konstan...@gmail.com) wrote: It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\) after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote: You don't actually need it per se - its just that some of the Spark libraries are referencing Hadoop libraries even if they ultimately do not call them. When I was doing some early builds of Spark on Windows, I admittedly had Hadoop on Windows running as well and had not run into this particular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: No, I don’t why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to read data from local filesystem On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote: By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Re: Run spark unit test on Windows 7
Thanks! will take a look at this later today. HTH! On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Denny, just created https://issues.apache.org/jira/browse/SPARK-2356 On Jul 3, 2014, at 7:06 PM, Denny Lee denny.g@gmail.com wrote: Hi Konstantin, Could you please create a jira item at: https://issues.apache.org/jira/browse/SPARK/ so this issue can be tracked? Thanks, Denny On July 2, 2014 at 11:45:24 PM, Konstantin Kudryavtsev (kudryavtsev.konstan...@gmail.com) wrote: It sounds really strange... I guess it is a bug, critical bug and must be fixed... at least some flag must be add (unable.hadoop) I found the next workaround : 1) download compiled winutils.exe from http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty(hadoop.home.dir, d:\\winutil\\) after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee denny.g@gmail.com wrote: You don't actually need it per se - its just that some of the Spark libraries are referencing Hadoop libraries even if they ultimately do not call them. When I was doing some early builds of Spark on Windows, I admittedly had Hadoop on Windows running as well and had not run into this particular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: No, I don’t why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to read data from local filesystem On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote: By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Andrew, it's windows 7 and I doesn't set up any env variables here The full stack trace: 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) at my.example.EtlTest.testETL(IxtoolsDailyAggTest.scala:13) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81) at org.junit.runner.JUnitCore.run(JUnitCore.java:130) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at
Anaconda Spark AMI
Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeated by a number of errors. Aggregations seemed to choke -- where as small takes executed as aspected (errors are linked to the gist): sc.appName u'PySparkShell' sc._conf.getAll() [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u' spark.app.name', u' PySparkShell'), (u'spark.executor.extraClassPath', u'/root/ephemeral-hdfs/conf'), (u'spark.master', u'spark://.compute-1.amazonaws.com:7077')] file = sc.textFile(hdfs:///user/root/chekhov.txt) file.take(2) [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov, u''] lines = file.filter(lambda x: len(x) 0) lines.count() VARIOUS ERROS DISCUSSED BELOW My first thought was that I could simply get away with including anaconda on the base AMI, point the path at /dir/anaconda/bin, and bake a new one. Doing so resulted in some strange py4j errors like the following: Py4JError: An error occurred while calling o17.partitions. Trace: py4j.Py4JException: Method partitions([]) does not exist At some point I also saw: SystemError: Objects/cellobject.c:24: bad argument to internal function which is really strange, possibly the result of a version mismatch? I had another thought of building spark from master on the AMI, leaving the spark directory in place, and removing the spark call from the modules list in spark-ec2 launch script. Unfortunately, this resulted in the following errors: https://gist.github.com/quasiben/da0f4778fbc87d02c088 If a spark dev was willing to make some time in the near future, I'm sure she/he and I could sort out these issues and give the Spark community a python distro ready to go for numerical computing. For instance, I'm not sure how pyspark calls out to launching a python session on a slave? Is this done as root or as the hadoop user? (i believe i changed /etc/bashrc to point to my anaconda bin directory so it shouldn't really matter. Is there something special about the py4j zip include in spark dir compared with the py4j in pypi? Thoughts? --Ben
Re: Spark Streaming Error Help - ERROR actor.OneForOneStrategy: key not found:
I think I have found my answers but if anyone has thoughts please share. After testing for a while I think the error doesn't have any effect on the process. I think it is the case that there must be elements left in the window from last run otherwise my system is completely whack. Please let me know if any of this looks incorrect, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-Help-ERROR-actor-OneForOneStrategy-key-not-found-tp8746p8750.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark logging strategy on YARN
Hi all, Could you please share your the best practices on writing logs in Spark? I’m running it on YARN, so when I check logs I’m bit confused… Currently, I’m writing System.err.println to put a message in log and access it via YARN history server. But, I don’t like this way… I’d like to use log4j/slf4j and write them to more concrete place… any practices? Thank you in advance
Re: Anaconda Spark AMI
Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to take effect, as described at https://spark.apache.org/docs/latest/ec2-scripts.html Hope that helps, -Jey On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote: Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeated by a number of errors. Aggregations seemed to choke -- where as small takes executed as aspected (errors are linked to the gist): sc.appName u'PySparkShell' sc._conf.getAll() [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'spark.app.name', u' PySparkShell'), (u'spark.executor.extraClassPath', u'/root/ephemeral-hdfs/conf'), (u'spark.master', u'spark://.compute-1.amazonaws.com:7077')] file = sc.textFile(hdfs:///user/root/chekhov.txt) file.take(2) [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov, u''] lines = file.filter(lambda x: len(x) 0) lines.count() VARIOUS ERROS DISCUSSED BELOW My first thought was that I could simply get away with including anaconda on the base AMI, point the path at /dir/anaconda/bin, and bake a new one. Doing so resulted in some strange py4j errors like the following: Py4JError: An error occurred while calling o17.partitions. Trace: py4j.Py4JException: Method partitions([]) does not exist At some point I also saw: SystemError: Objects/cellobject.c:24: bad argument to internal function which is really strange, possibly the result of a version mismatch? I had another thought of building spark from master on the AMI, leaving the spark directory in place, and removing the spark call from the modules list in spark-ec2 launch script. Unfortunately, this resulted in the following errors: https://gist.github.com/quasiben/da0f4778fbc87d02c088 If a spark dev was willing to make some time in the near future, I'm sure she/he and I could sort out these issues and give the Spark community a python distro ready to go for numerical computing. For instance, I'm not sure how pyspark calls out to launching a python session on a slave? Is this done as root or as the hadoop user? (i believe i changed /etc/bashrc to point to my anaconda bin directory so it shouldn't really matter. Is there something special about the py4j zip include in spark dir compared with the py4j in pypi? Thoughts? --Ben
Re: LIMIT with offset in SQL queries
Doing an offset is actually pretty expensive in a distributed query engine, so in many cases it probably makes sense to just collect and then perform the offset as you are doing now. This is unless the offset is very large. Another limitation here is that HiveQL does not support OFFSET. That said if you want to open a JIRA we would consider implementing it. On Wed, Jul 2, 2014 at 1:37 PM, durin m...@simon-schaefer.net wrote: Hi, in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause, s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5. As far as I can see, this is not possible in Spark-SQL. The best solution I have to imitate that (using Scala) is converting the RDD into an Array via collect() and then using a for-loop to return certain elements from that Array. Is there a better solution regarding performance and are there plans to implement an offset for LIMIT? Kind regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LIMIT-with-offset-in-SQL-queries-tp8673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Which version of Hive support Spark Shark
Spark SQL is based on Hive 0.12.0. On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote: Hi , Can any one please help me to understand which version of Hive support Spark and Shark -- -- Regards, RAVI PRASAD. T
Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster
Hi Jack, 1. Several previous instances of key not valid? error had been attributed to memory issues, either memory allocated per executor or per task, depending on the context. You can google it to see some examples. 2. I think your case is similar, even though its happening due to broadcast. I suspect specifically, this line 14/07/02 18:20:09 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 after the driver commanded a shutdown. It's happening only for TorrentBroadcast, because HttpBroadcast does not store intermediate chunks of a broadcast in memory. 3. You might want to allocate more memory to Spark executors to take advantage of in-memory processing. ~300MB caching space per machine is likely to be too small for most jobs. 4. Another common cause of disconnection is the spark.akka.frameSize parameter. You can try playing with it. While you don't enough memory to crank it up, you can try moving it up and down within reason. 5. There is one more curious line in your trace: 14/07/02 18:20:06 INFO BlockManager: Removing broadcast 0 Nothing should've been there to remove in the first place. 6. Finally, we found in our benchmarks that using TorrentBroadcast in smaller clusters (10) and small data size (10MB) has no benefit over HttpBroadcast, and often worse. I'd suggest sticking to HttpBroadcast unless you have gigantic broadcast (=1GB) or too many nodes (many 10s or 100s). Hope it helps, Mosharaf -- Mosharaf Chowdhury http://www.mosharaf.com/ On Thu, Jul 3, 2014 at 7:48 AM, jackxucs jackx...@gmail.com wrote: Hello, I am running the BroadcastTest example in a standalone cluster using spark-submit. I have 8 host machines and made Host1 the master. Host2 to Host8 act as 7 workers to connect to the master. The connection was fine as I could see all 7 hosts on the master web ui. The BroadcastTest example with Http broadcast also works fine, I think, as there was no error msg and all workers EXITED at the end. But when I changed the third argument from Http to Torrent to use Torrent broadcast, all workers got a KILLED status once they reached sc.stop(). Below is the stderr on one of the workers when running Torrent broadcast (I masked the IP addresses): == 14/07/02 18:20:03 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771] 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@dyn-xxx-xx-xx-xx:37771] 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661] 14/07/02 18:20:04 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@dyn-xxx-xx-xx-xx:53661] 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@dyn-xxx-xx-xx-xx:42436/user/CoarseGrainedScheduler 14/07/02 18:20:04 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker 14/07/02 18:20:04 INFO Remoting: Remoting shut down 14/07/02 18:20:04 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/07/02 18:20:04 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@dyn-xxx-xx-xx-xx:32818/user/Worker 14/07/02 18:20:04 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 14/07/02 18:20:04 INFO SecurityManager: Changing view acls to: root 14/07/02 18:20:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root) 14/07/02 18:20:04 INFO Slf4jLogger: Slf4jLogger started 14/07/02 18:20:04 INFO Remoting: Starting remoting 14/07/02 18:20:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@dyn-xxx-xx-xx-xx:57883] 14/07/02 18:20:05 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@dyn-xxx-xx-xx-xx:57883] 14/07/02 18:20:05 INFO SparkEnv: Connecting to MapOutputTracker:
Sample datasets for MLlib and Graphx
Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any help. Best regards, Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions For svm there are a couple of ad click prediction datasets of pretty large size. For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ — Sent from Mailbox On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers alexander.rigg...@gmail.com wrote: Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any help. Best regards, Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a second look. Nick Pentreath wrote For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
The Kaggle data is not in libsvm format so you'd have to do some transformation. The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think. To my knowledge these are the largest binary classification datasets I've come across which are easily publicly available (very happy to be proved wrong about this though :) — Sent from Mailbox On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers alexander.rigg...@gmail.com wrote: Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a second look. Nick Pentreath wrote For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Case class in java
This will load listed jars when SparkContext is created. In case of REPL, we define and import classes after SparkContext created. According to above mentioned site, Executor install class loader in 'addReplClassLoaderIfNeeded' method using spark.repl.class.uri configuration. Then I will try to make class server distributing *dynamically created classes* in my driver application to Executors as spark REPL. Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Case-class-in-java-tp8720p8765.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Kafka - streaming from multiple topics
Sergey, On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.com wrote: On the other hand, under the hood KafkaInputDStream which is create with this KafkaUtils call, calls ConsumerConnector.createMessageStream which returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed. I wonder if this is a bug. After all, KafkaUtils.createStream() returns a DStream[(String, String)], which pretty much looks like it should be a (topic - message) mapping. However, for me, the key is always null. Maybe you could consider filing a bug/wishlist report? Tobias
[no subject]
Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - * Every jar under hadoop in my classpath * Hadoop HDFS and Client in my pom.xml I find it odd that the app writes checkpoint files to HDFS successfully for a couple of cycles then throws this exception. This would suggest the problem is not with the syntax of the hdfs URL, for example. Any thoughts on what I'm missing? Thanks, Steve Mesos : 0.18.2 Spark : 0.9.1 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0) 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting job 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 140443646 ms.0 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times (most recent failure: Exception failure: java.io.IOException: No FileSystem for scheme: hdfs) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
No FileSystem for scheme: hdfs
...and a real subject line. From: Steven Cox [s...@renci.org] Sent: Thursday, July 03, 2014 9:21 PM To: user@spark.apache.org Subject: Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - * Every jar under hadoop in my classpath * Hadoop HDFS and Client in my pom.xml I find it odd that the app writes checkpoint files to HDFS successfully for a couple of cycles then throws this exception. This would suggest the problem is not with the syntax of the hdfs URL, for example. Any thoughts on what I'm missing? Thanks, Steve Mesos : 0.18.2 Spark : 0.9.1 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0) 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting job 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 140443646 ms.0 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times (most recent failure: Exception failure: java.io.IOException: No FileSystem for scheme: hdfs) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
Re: No FileSystem for scheme: hdfs
Are the hadoop configuration files on the classpath for your mesos executors? On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote: ...and a real subject line. -- *From:* Steven Cox [s...@renci.org] *Sent:* Thursday, July 03, 2014 9:21 PM *To:* user@spark.apache.org *Subject:* Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - * Every jar under hadoop in my classpath * Hadoop HDFS and Client in my pom.xml I find it odd that the app writes checkpoint files to HDFS successfully for a couple of cycles then throws this exception. This would suggest the problem is not with the syntax of the hdfs URL, for example. Any thoughts on what I'm missing? Thanks, Steve Mesos : 0.18.2 Spark : 0.9.1 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0) 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting job 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 140443646 ms.0 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times (most recent failure: Exception failure: java.io.IOException: No FileSystem for scheme: hdfs) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
RE: No FileSystem for scheme: hdfs
They weren't. They are now and the logs look a bit better - like perhaps some serialization is completing that wasn't before. But I still get the same error periodically. Other thoughts? From: Soren Macbeth [so...@yieldbot.com] Sent: Thursday, July 03, 2014 9:54 PM To: user@spark.apache.org Subject: Re: No FileSystem for scheme: hdfs Are the hadoop configuration files on the classpath for your mesos executors? On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.orgmailto:s...@renci.org wrote: ...and a real subject line. From: Steven Cox [s...@renci.orgmailto:s...@renci.org] Sent: Thursday, July 03, 2014 9:21 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - * Every jar under hadoop in my classpath * Hadoop HDFS and Client in my pom.xml I find it odd that the app writes checkpoint files to HDFS successfully for a couple of cycles then throws this exception. This would suggest the problem is not with the syntax of the hdfs URL, for example. Any thoughts on what I'm missing? Thanks, Steve Mesos : 0.18.2 Spark : 0.9.1 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0) 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting job 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 140443646 ms.0 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times (most recent failure: Exception failure: java.io.IOException: No FileSystem for scheme: hdfs) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.orghttp://org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
Re: graphx Joining two VertexPartitions with different indexes is slow.
A common reason for the Joining ... is slow message is that you're joining VertexRDDs without having cached them first. This will cause Spark to recompute unnecessarily, and as a side effect, the same index will get created twice and GraphX won't be able to do an efficient zip join. For example, the following code will counterintuitively produce the Joining ... is slow message: val a = VertexRDD(sc.parallelize((1 to 100).map(x = (x.toLong, x a.leftJoin(a) { (id, a, b) = a + b } The remedy is to call a.cache() before a.leftJoin(a). Ankur http://www.ankurdave.com/
SparkSQL with Streaming RDD
Would appreciate help on: 1. How to convert streaming RDD into JavaSchemaRDD 2. How to structure the driver program to do interactive SparkSQL Using Spark 1.0 with Java. I have steaming code that does upateStateByKey resulting in JavaPairDStream. I am using JavaDStream::compute(time) to get JavaRDD. However I am getting the runtime expection: ERROR at runtime: org.apache.spark.streaming.dstream.StateDStream@18dc1b2 has not been initialized I know the code is executed before the stream is initialized. Does anyone have suggestions on how the design the code so accommodate async processing? Code Fragment: //Spark SQL for the N seconds interval SparkConf sparkConf = new SparkConf().setMaster(SPARK_MASTER).setAppName(SQLStream); JavaSparkContext ctx = new JavaSparkContext(sparkConf); final JavaSQLContext sqlCtx = new org.apache.spark.sql.api.java.JavaSQLContext(ctx); //convert JavaPairDStream to JavaDStream JavaDStreamTuple2lt;String,TestConnection.DiscoveryRecord javaDStream = statefullStream.toJavaDStream(); //Convert to TupleK,U to U JavaDStreamDiscoveryRecord javaRDD = javaDStream.map( new FunctionTuple2lt;String,TestConnection.DiscoveryRecord, DiscoveryRecord(){ public DiscoveryRecord call(Tuple2String,TestConnection.DiscoveryRecord eachT) { return eachT._2; } } ); //Convert JavaDStream to JavaRDD //ERROR next line at runtime: org.apache.spark.streaming.dstream.StateDStream@18dc1b2 has not been initialized JavaRDDDiscoveryRecord computedJavaRDD = javaRDD.compute(new Time(10)); JavaSchemaRDD schemaStatefull = sqlCtx.applySchema( computedJavaRDD , DiscoveryRecord.class); schemaStatefull.registerAsTable(statefull); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-with-Streaming-RDD-tp8774.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: No FileSystem for scheme: hdfs
Most likely you are missing the hadoop configuration files (present in conf/*.xml). Thanks Best Regards On Fri, Jul 4, 2014 at 7:38 AM, Steven Cox s...@renci.org wrote: They weren't. They are now and the logs look a bit better - like perhaps some serialization is completing that wasn't before. But I still get the same error periodically. Other thoughts? -- *From:* Soren Macbeth [so...@yieldbot.com] *Sent:* Thursday, July 03, 2014 9:54 PM *To:* user@spark.apache.org *Subject:* Re: No FileSystem for scheme: hdfs Are the hadoop configuration files on the classpath for your mesos executors? On Thu, Jul 3, 2014 at 6:45 PM, Steven Cox s...@renci.org wrote: ...and a real subject line. -- *From:* Steven Cox [s...@renci.org] *Sent:* Thursday, July 03, 2014 9:21 PM *To:* user@spark.apache.org *Subject:* Folks, I have a program derived from the Kafka streaming wordcount example which works fine standalone. Running on Mesos is not working so well. For starters, I get the error below No FileSystem for scheme: hdfs. I've looked at lots of promising comments on this issue so now I have - * Every jar under hadoop in my classpath * Hadoop HDFS and Client in my pom.xml I find it odd that the app writes checkpoint files to HDFS successfully for a couple of cycles then throws this exception. This would suggest the problem is not with the syntax of the hdfs URL, for example. Any thoughts on what I'm missing? Thanks, Steve Mesos : 0.18.2 Spark : 0.9.1 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 296 (task 1514.0:0) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 297 (task 1514.0:1) 14/07/03 21:14:20 WARN TaskSetManager: Lost TID 298 (task 1514.0:0) 14/07/03 21:14:20 ERROR TaskSetManager: Task 1514.0:0 failed 10 times; aborting job 14/07/03 21:14:20 ERROR JobScheduler: Error running job streaming job 140443646 ms.0 org.apache.spark.SparkException: Job aborted: Task 1514.0:0 failed 10 times (most recent failure: Exception failure: java.io.IOException: No FileSystem for scheme: hdfs) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)