Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter
graphframes Python code when it is loaded as > a Spark package. > > To workaround this, I extract the graphframes Python directory locally > where I run pyspark into a directory called graphframes. > > > > > > > On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carte

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300 c

Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
gt; Do you extract only the stuff needed? What are the algorithm parameters? > > > On 07 Jun 2016, at 13:09, Franc Carter wrote: > > > > > > Hi, > > > > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and > am interested in how it might

number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
Hi, I'm having trouble working out how to get the number of executors set when using sparkR.init(). If I start sparkR with sparkR --master yarn --num-executors 6 then I get 6 executors However, if start sparkR with sparkR followed by sc <- sparkR.init(master="yarn-client", sparkEnvir

Re: number of executors in sparkR.init()

2015-12-25 Thread Franc Carter
t with sparkR.init()? > > > _________ > From: Franc Carter > Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > > > > Hi, > > I'm having trouble working out how to get the number of execut

pyspark: conditionals inside functions

2016-01-08 Thread Franc Carter
Hi, I'm trying to write a short function that returns the last sunday of the week of a given date, code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) if (n == 7): return day else: return sun this g

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
My Python is not particularly good, so I'm afraid I don't understand what that mean cheers On 9 January 2016 at 14:45, Franc Carter wrote: > > Hi, > > I'm trying to write a short function that returns the last sunday of the > week of a given date, co

pyspark: calculating row deltas

2016-01-09 Thread Franc Carter
Hi, I have a DataFrame with the columns ID,Year,Value I'd like to create a new Column that is Value2-Value1 where the corresponding Year2=Year-1 At the moment I am creating a new DataFrame with renamed columns and doing DF.join(DF2, . . . .) This looks cumbersome to me, is there abt

Re: pyspark: conditionals inside functions

2016-01-09 Thread Franc Carter
Got it, I needed to use the when/otherwise construct - code below def getSunday(day): day = day.cast("date") sun = next_day(day, "Sunday") n = datediff(sun,day) x = when(n==7,day).otherwise(sun) return x On 10 January 2016 at 08:41, Franc Carter w

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
13 101 > 32014 102 > > What's your desired output ? > > Femi > > > On Sat, Jan 9, 2016 at 4:55 PM, Franc Carter > wrote: > >> >> Hi, >> >> I have a DataFrame with the columns >> >> ID,Year,Value >> >> I'd

Re: pyspark: calculating row deltas

2016-01-10 Thread Franc Carter
Thanks cheers On 10 January 2016 at 22:35, Blaž Šnuderl wrote: > This can be done using spark.sql and window functions. Take a look at > https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html > > On Sun, Jan 10, 2016 at 11:07 AM, Franc Car

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
I had problems doing this as well - I ended up using 'withColumn', it's not particularly graceful but it worked (1.5.2 on AWS EMR) cheerd On 3 February 2016 at 22:06, Devesh Raj Singh wrote: > Hi, > > i am trying to create dummy variables in sparkR by creating new columns > for categorical vari

Re: sparkR not able to create /append new columns

2016-02-03 Thread Franc Carter
end the last added column( in the loop) will be the added column. like in > my code above. > > On Wed, Feb 3, 2016 at 5:05 PM, Franc Carter > wrote: > >> >> I had problems doing this as well - I ended up using 'withColumn', it's >> not particularly g

filter by dict() key in pySpark

2016-02-21 Thread Franc Carter
I have a DataFrame that has a Python dict() as one of the columns. I'd like to filter he DataFrame for those Rows that where the dict() contains a specific value. e.g something like this:- DF2 = DF1.filter('name' in DF1.params) but that gives me this error ValueError: Cannot convert column i

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that where the dict() contains a &g

subscribe

2015-08-05 Thread Franc Carter
subscribe

SparkR csv without headers

2015-08-18 Thread Franc Carter
-- *Franc Carter* I Systems ArchitectI RoZetta Technology [image: Description: Description: Description: cid:image003.jpg@01D02903.9B540580] L4. 55 Harrington Street, THE ROCKS, NSW, 2000 PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA *T* +61 2 8355 2515

Re: SparkR csv without headers

2015-08-20 Thread Franc Carter
t; integer), …) > > read.df ( …, schema = schema) > > > > *From:* Franc Carter [mailto:franc.car...@rozettatech.com] > *Sent:* Wednesday, August 19, 2015 1:48 PM > *To:* user@spark.apache.org > *Subject:* SparkR csv without headers > > > > > > H

Re: No AMI for Spark 1.2 using ec2 scripts

2015-01-26 Thread Franc Carter
he Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: How to create spark AMI in AWS

2015-02-09 Thread Franc Carter
ootstrap script to create the Spark AMI? Is it here( > https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ? > 2. What is the base image of the Spark AMI? Eg, the base image of this ( > https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm) > 3. Shall I install s

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
------ > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Franc Carter* | Systems Architect | Rozetta Technology franc.car...@rozettatech.com | www.rozettatechnology.com Tel: +61 2 8355 2515 Level 4, 55 Harrington St, The Rocks NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215 AUSTRALIA

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
> > Happy hacking > > Chris > > Von: Franc Carter > Datum: Mittwoch, 11. Februar 2015 10:03 > An: Paolo Platter > Cc: Mike Trienis , "user@spark.apache.org" < > user@spark.apache.org> > Betreff: Re: Datastore HDFS vs Cassandra > > > One a

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
ffset in >>> response to >>> RequestTimeTooSkewed error. Local machine and S3 server disagree on the >>> time by approximately 0 seconds. Retrying connection. >>> >>> After that there are tons of 403/forbidden errors and then job fails. >>> It's s

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
gt; >>> it didn't help... > >>> > >>> **`--deploy-mode=cluster`:** > >>> > >>> From my laptop: > >>> > >>> ./bin/spark-submit --master > >>> spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:707

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread Franc Carter
> blocks: 48. Algorithm and capacity permitting, you've just massively > boosted your load time. Downstream, if data can be thinned down, then you > can start looking more at things you can do on a single host : a machine > that can be in your Hadoop cluster. Ask YARN nicely

Reading from a centralized stored

2015-01-05 Thread Franc Carter
Hi, I'm trying to understand how a Spark Cluster behaves when the data it is processing resides on a centralized/remote store (S3, Cassandra, DynamoDB, RDBMS etc). Does every node in the cluster retrieve all the data from the central store ? thanks -- *Franc Carter* | Systems Arch

Re: Reading from a centralized stored

2015-01-05 Thread Franc Carter
can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark Cluster behaves when the data it i

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
r implement preferred > locations. You can run an rdbms on the same nodes as spark, but JdbcRDD > doesn't implement preferred locations. > > On Mon, Jan 5, 2015 at 6:25 PM, Franc Carter > wrote: > >> >> Hi, >> >> I'm trying to understand how a Spark C

Re: Reading from a centralized stored

2015-01-06 Thread Franc Carter
15 at 6:59 AM, Cody Koeninger wrote: > No, most rdds partition input data appropriately. > > On Tue, Jan 6, 2015 at 1:41 PM, Franc Carter > wrote: > >> >> One more question, to be clarify. Will every node pull in all the data ? >> >> thanks >> >&