Re: NA value handling in sparkR

2016-01-26 Thread Deborah Siegel
e setosa, virginica would be created with 0 and 1 as > values On Mon, Jan 25, 2016 at 12:37 PM, Deborah Siegel <deborah.sie...@gmail.com> wrote: > Maybe not ideal, but since read.df is inferring all columns from the csv > containing "NA" as type of strings, one could filter

Re: NA value handling in sparkR

2016-01-25 Thread Deborah Siegel
you are right. > > I think the problem is with reading of csv files. read.df is not > considering NAs in the CSV file > > So what would be a workable solution in dealing with NAs in csv files? > > > > On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail

SparkR pca?

2015-09-18 Thread Deborah Siegel
Hi, Can PCA be implemented in a SparkR-MLLib integration? perhaps 2 separate issues.. 1) Having the methods in SparkRWrapper and RFormula which will send the right input types through the pipeline MLLib PCA operates either on a RowMatrix, or the feature vector of an RDD[LabeledPoint]. The

Re: SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
` exists ? The error message seems to indicate it is trying to pick up Spark from that location and can't seem to find Spark installed there. Thanks Shivaram On Thu, Aug 20, 2015 at 3:30 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hello, I have previously successfully run SparkR

SparkR - can't create spark context - JVM not ready

2015-08-20 Thread Deborah Siegel
Hello, I have previously successfully run SparkR in RStudio, with: Sys.setenv(SPARK_HOME=~/software/spark-1.4.1-bin-hadoop2.4) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR) sc - sparkR.init(master=local[2],appName=SparkR-example) Then I tried putting some

Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
I think I just answered my own question. The privitization of the RDD API might have resulted in my error, because this worked: randomMatBr - SparkR:::broadcast(sc, randomMat) On Mon, Aug 3, 2015 at 4:59 PM, Deborah Siegel deborah.sie...@gmail.com wrote: Hello, In looking at the SparkR

SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message randomMat - matrix(nrow=10, ncol=10, data=rnorm(100))

contributing code - how to test

2015-04-24 Thread Deborah Siegel
Hi, I selected a starter task in JIRA, and made changes to my github fork of the current code. I assumed I would be able to build and test. % mvn clean compile was fine but %mvn package failed [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test)

ec2 persistent-hdfs with ebs using spot instances

2015-03-10 Thread Deborah Siegel
Hello, I'm new to ec2. I've set up a spark cluster on ec2 and am using persistent-hdfs with the data nodes mounting ebs. I launched my cluster using spot-instances ./spark-ec2 -k mykeypair -i ~/aws/mykeypair.pem -t m3.xlarge -s 4 -z us-east-1c --spark-version=1.2.0 --spot-price=.0321

Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread Deborah Siegel
Harika, I think you can modify existing spark on ec2 cluster to run Yarn mapreduce, not sure if this is what you are looking for. To try: 1) logon to master 2) go into either ephemeral-hdfs/conf/ or persistent-hdfs/conf/ and add this to mapred-site.xml : property

Re: Number of cores per executor on Spark Standalone

2015-03-01 Thread Deborah Siegel
Hi, Someone else will have a better answer. I think that for standalone mode, executors will grab whatever cores they can based on either configurations on the worker, or application specific configurations. Could be wrong, but I believe mesos is similar to this- and that YARN is alone in the

documentation - graphx-programming-guide error?

2015-03-01 Thread Deborah Siegel
Hello, I am running through examples given on http://spark.apache.org/docs/1.2.1/graphx-programming-guide.html The section for Map Reduce Triplets Transition Guide (Legacy) indicates that one can run the following .aggregateMessages code val graph: Graph[Int, Float] = ... def msgFun(triplet:

Re: Running spark function on parquet without sql

2015-02-27 Thread Deborah Siegel
Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will *not* be cached using the in-memory columnar format, and therefore

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for