Apache Spark standalone mode: number of cores
I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting: local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). local[*] Run Spark locally with as many worker threads as logical cores on your machine. Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDDs. How does it benefit and what internally is going on when Spark utilizes several logical cores? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-standalone-mode-number-of-cores-tp21342.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Creating Apache Spark-powered “As Service” applications
The question is about the ways to create a Windows desktop-based and/or web-based application client that is able to connect and talk to the server containing Spark application (either local or on-premise cloud distributions) in the run-time. Any language/architecture may work. So far, I've seen two things that may be a help in that, but I'm not so sure if they would be the best alternative and how they work yet: Spark Job Server - https://github.com/spark-jobserver/spark-jobserver - defines a REST API for Spark Hue - http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ - uses item 1) Any advice would be appreciated. Simple toy example program (or steps) that shows, e.g. how to build such client for simply creating Spark Context on a local machine and say reading text file and returning basic stats would be ideal answer! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Specifying AMI when using Spark EC-2 scripts
Hi there, Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when launching Spark on Amazon cloud with provided scripts? What is the default AMI, operating system that is launched by EC-2 script? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-AMI-when-using-Spark-EC-2-scripts-tp21658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Submitting jobs on Spark EC2 cluster: class not found, even if it's on CLASSPATH
Hi there, I'm trying out Spark Job Server (REST) to submit jobs to spark cluster. I believe that my problem is unrelated to this specific software, but otherwise generic issue with missing jars on paths. So every application implements the trait with SparkJob class: /object LongPiJob extends SparkJob { ../ SparkJob class is available through the jar file, built by Spark Job Server Scala application. When I run all this with local Spark cluster, everything works fine after I add the export line into spark-env.sh: /export SPARK_CLASSPATH=$SPARK_HOME/job-server/spark-job-server.jar/ However, when I do the same on Spark cluster on EC2, I get the errors: /java.lang.NoClassDefFoundError: spark/jobserver/SparkJob/ I've added the path in spark-env.sh (on remote Spark master Amazon machine): /export MASTER=`cat /root/spark-ec2/cluster-url` *export SPARK_CLASSPATH=/root/spark/job-server/spark-job-server.jar* export SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/lib/native/ export SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/ephemeral-hdfs/conf/ Also, when I run ./bin/compute-classpath.sh, I can see the required jar, defining missing class at the first place: /bin]$ ./compute-classpath.sh Spark assembly has been built with Hive, including Datanucleus jars on classpath /root/spark/job-server/spark-job-server.jar:/root/spark/job-server/spark-job-server.jar::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar/ What am I missing? I'd greatly appreciate your help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-on-Spark-EC2-cluster-class-not-found-even-if-it-s-on-CLASSPATH-tp21864.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Submitting jobs to Spark EC2 cluster remotely
I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running. I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for `--deploy-mode`: **`--deploy-mode=client`:** From my laptop: ./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar Results in the following indefinite warnings/errors: WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1 ...and failed drivers - in Spark Web UI Completed Drivers with State=ERROR appear. I've tried to pass limits for cores and memory to submit script but it didn't help... **`--deploy-mode=cluster`:** From my laptop: ./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar The result is: Driver successfully submitted as driver-20150223023734-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150223023734-0007 is ERROR Exception from cluster was: java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-to-Spark-EC2-cluster-remotely-tp21762.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Launching Spark cluster on EC2 with Ubuntu AMI
I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch spark-ubuntu-cluster Everything starts OK and instances are launched: Found 1 master(s), 2 slaves Waiting for all instances in cluster to enter 'ssh-ready' state. Generating cluster's SSH key on master. But then I'm getting the following SSH errors until it stops trying and quits: bash: git: command not found Connection to ***.us-west-2.compute.amazonaws.com closed. Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o', 'UserKnownHostsFile=/dev/null', '-t', '-t', u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 git clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit status 127 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work... Any advice would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Create DataFrame from textFile with unknown columns
Assuming there is a text file with unknown number of columns, how one would create a data frame? I have followed the example in Spark Docs where one first creates a RDD of Rows, but it seems that you have to know exact number of columns in file and can't to just this: val rowRDD = sc.textFile(path/file).map(_.split( |\\,)).map(_.org.apache.spark.sql.Row(_)) The above will work if I'd do ...Row(_(0), _(1), ...) but the number of columns is unknown. Also assuming that one has RDD[Row], why .toDF() is not defined on this RDD type? Is it the only way to call .createDataFrame(...) method to create a DF out of RDD[Row]? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-DataFrame-from-textFile-with-unknown-columns-tp22386.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Add row IDs column to data frame
What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at line 3), and anyways doesn't look like the best route possible: var dataDF = sc.textFile(path/file).toDF() val rowDF = sc.parallelize(1 to dataDF.count().toInt).toDF(ID) dataDF = dataDF.withColumn(ID, rowDF(ID)) Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Add row IDs column to data frame
More generic version of a question below: Is it possible to append a column to existing DataFrame at all? I understand that this is not an easy task in Spark environment, but is there any workaround? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22427.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org