Apache Spark standalone mode: number of cores

2015-01-23 Thread olegshirokikh
I'm trying to understand the basics of Spark internals and Spark
documentation for submitting applications in local mode says for
spark-submit --master setting:

local[K] Run Spark locally with K worker threads (ideally, set this to the
number of cores on your machine).

local[*] Run Spark locally with as many worker threads as logical cores on
your machine.
Since all the data is stored on a single local machine, it does not benefit
from distributed operations on RDDs.

How does it benefit and what internally is going on when Spark utilizes
several logical cores?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-standalone-mode-number-of-cores-tp21342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread olegshirokikh
The question is about the ways to create a Windows desktop-based and/or
web-based application client that is able to connect and talk to the server
containing Spark application (either local or on-premise cloud
distributions) in the run-time.

Any language/architecture may work. So far, I've seen two things that may be
a help in that, but I'm not so sure if they would be the best alternative
and how they work yet:

Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
defines a REST API for Spark
Hue -
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
- uses item 1)

Any advice would be appreciated. Simple toy example program (or steps) that
shows, e.g. how to build such client for simply creating Spark Context on a
local machine and say reading text file and returning basic stats would be
ideal answer!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread olegshirokikh
Hi there,

Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when
launching Spark on Amazon cloud with provided scripts?

What is the default AMI, operating system that is launched by EC-2 script?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-AMI-when-using-Spark-EC-2-scripts-tp21658.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submitting jobs on Spark EC2 cluster: class not found, even if it's on CLASSPATH

2015-03-01 Thread olegshirokikh
Hi there,

I'm trying out Spark Job Server (REST) to submit jobs to spark cluster. I
believe that my problem is unrelated to this specific software, but
otherwise generic issue with missing jars on paths. So every application
implements the trait with SparkJob class:

/object LongPiJob extends SparkJob {
../

SparkJob class is available through the jar file, built by Spark Job Server
Scala application. When I run all this with local Spark cluster, everything
works fine after I add the export line into spark-env.sh:

/export SPARK_CLASSPATH=$SPARK_HOME/job-server/spark-job-server.jar/

However, when I do the same on Spark cluster on EC2, I get the errors:

/java.lang.NoClassDefFoundError: spark/jobserver/SparkJob/

I've added the path in spark-env.sh (on remote Spark master Amazon machine):

/export MASTER=`cat /root/spark-ec2/cluster-url`

*export SPARK_CLASSPATH=/root/spark/job-server/spark-job-server.jar*

export
SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/ephemeral-hdfs/lib/native/
export
SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/ephemeral-hdfs/conf/

Also, when I run ./bin/compute-classpath.sh, I can see the required jar,
defining missing class at the first place:

/bin]$ ./compute-classpath.sh 
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
/root/spark/job-server/spark-job-server.jar:/root/spark/job-server/spark-job-server.jar::/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar/


What am I missing? I'd greatly appreciate your help




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-on-Spark-EC2-cluster-class-not-found-even-if-it-s-on-CLASSPATH-tp21864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submitting jobs to Spark EC2 cluster remotely

2015-02-22 Thread olegshirokikh
I've set up the EC2 cluster with Spark. Everything works, all master/slaves
are up and running.

I'm trying to submit a sample job (SparkPi). When I ssh to cluster and
submit it from there - everything works fine. However when driver is created
on a remote host (my laptop), it doesn't work. I've tried both modes for
`--deploy-mode`:

**`--deploy-mode=client`:**

From my laptop:

./bin/spark-submit --master
spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class
SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

Results in the following indefinite warnings/errors:

  WARN TaskSchedulerImpl: Initial job has not accepted any resources;
 check your cluster UI to ensure that workers are registered and have
 sufficient memory 15/02/22 18:30:45 

 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
 15/02/22 18:30:45 

 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1

...and failed drivers - in Spark Web UI Completed Drivers with
State=ERROR appear.

I've tried to pass limits for cores and memory to submit script but it
didn't help...

**`--deploy-mode=cluster`:**

From my laptop:

./bin/spark-submit --master
spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode
cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

The result is:

  Driver successfully submitted as driver-20150223023734-0007 ...
 waiting before polling master for driver state ... polling master for
 driver state State of driver-20150223023734-0007 is ERROR Exception
 from cluster was: java.io.FileNotFoundException: File
 file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
 does not exist. java.io.FileNotFoundException: File
 file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
 does not exist.   at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
   at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)at
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
   at
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)

 So, I'd appreciate any pointers on what is going wrong and some guidance
how to deploy jobs from remote client. Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-jobs-to-Spark-EC2-cluster-remotely-tp21762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-22 Thread olegshirokikh
I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using
the following:

./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
--region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
--instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
spark-ubuntu-cluster

Everything starts OK and instances are launched:

Found 1 master(s), 2 slaves
Waiting for all instances in cluster to enter 'ssh-ready' state.
Generating cluster's SSH key on master.

But then I'm getting the following SSH errors until it stops trying and
quits:

bash: git: command not found
Connection to ***.us-west-2.compute.amazonaws.com closed.
Error executing remote command, retrying after 30 seconds: Command '['ssh',
'-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
'UserKnownHostsFile=/dev/null', '-t', '-t',
u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit
status 127

I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work... Any advice would be greatly appreciated!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Create DataFrame from textFile with unknown columns

2015-04-05 Thread olegshirokikh
Assuming there is a text file with unknown number of columns, how one would
create a data frame? I have followed the example in Spark Docs where one
first creates a RDD of Rows, but it seems that you have to know exact number
of columns in file and can't to just this:

val rowRDD = sc.textFile(path/file).map(_.split(
|\\,)).map(_.org.apache.spark.sql.Row(_))

The above will work if I'd do ...Row(_(0), _(1), ...) but the number of
columns is unknown.

Also assuming that one has RDD[Row], why .toDF() is not defined on this RDD
type? Is it the only way to call .createDataFrame(...) method to create a DF
out of RDD[Row]?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-DataFrame-from-textFile-with-unknown-columns-tp22386.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Add row IDs column to data frame

2015-04-05 Thread olegshirokikh
What would be the most efficient neat method to add a column with row ids to
dataframe?

I can think of something as below, but it completes with errors (at line 3),
and anyways doesn't look like the best route possible:

var dataDF = sc.textFile(path/file).toDF()
val rowDF = sc.parallelize(1 to dataDF.count().toInt).toDF(ID)
dataDF = dataDF.withColumn(ID, rowDF(ID))

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Add row IDs column to data frame

2015-04-08 Thread olegshirokikh
More generic version of a question below:

Is it possible to append a column to existing DataFrame at all? I understand
that this is not an easy task in Spark environment, but is there any
workaround?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-IDs-column-to-data-frame-tp22385p22427.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org