Re: Add row IDs column to data frame

2015-04-08 Thread olegshirokikh
More generic version of a question below: Is it possible to append a column to existing DataFrame at all? I understand that this is not an easy task in Spark environment, but is there any workaround? -- View this message in context:

Create DataFrame from textFile with unknown columns

2015-04-05 Thread olegshirokikh
Assuming there is a text file with unknown number of columns, how one would create a data frame? I have followed the example in Spark Docs where one first creates a RDD of Rows, but it seems that you have to know exact number of columns in file and can't to just this: val rowRDD =

Add row IDs column to data frame

2015-04-05 Thread olegshirokikh
What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at line 3), and anyways doesn't look like the best route possible: var dataDF = sc.textFile(path/file).toDF() val rowDF = sc.parallelize(1 to

Submitting jobs on Spark EC2 cluster: class not found, even if it's on CLASSPATH

2015-03-01 Thread olegshirokikh
Hi there, I'm trying out Spark Job Server (REST) to submit jobs to spark cluster. I believe that my problem is unrelated to this specific software, but otherwise generic issue with missing jars on paths. So every application implements the trait with SparkJob class: /object LongPiJob extends

Submitting jobs to Spark EC2 cluster remotely

2015-02-22 Thread olegshirokikh
I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running. I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've

Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-22 Thread olegshirokikh
I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch

Specifying AMI when using Spark EC-2 scripts

2015-02-15 Thread olegshirokikh
Hi there, Is there a way to specify the AWS AMI with particular OS (say Ubuntu) when launching Spark on Amazon cloud with provided scripts? What is the default AMI, operating system that is launched by EC-2 script? Thanks -- View this message in context:

Apache Spark standalone mode: number of cores

2015-01-23 Thread olegshirokikh
I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting: local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). local[*] Run Spark locally

Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread olegshirokikh
The question is about the ways to create a Windows desktop-based and/or web-based application client that is able to connect and talk to the server containing Spark application (either local or on-premise cloud distributions) in the run-time. Any language/architecture may work. So far, I've seen