Re: calculate diff of value and median in a group

2017-07-14 Thread roni
I was using this function percentile_approx on 100GB of compressed data and it just hangs there. Any pointers? On Wed, Mar 22, 2017 at 6:09 PM, ayan guha wrote: > For median, use percentile_approx with 0.5 (50th percentile is the median) > > On Thu, Mar 23, 2017 at 11:01 AM, Yong Zhang wrote:

support vector regression in spark

2016-12-01 Thread roni
Hi All, I want to know how can I do support vector regression in SPARK? Thanks R

Re: SVM regression in Spark

2016-11-30 Thread roni
Hi Spark expert, Can anyone help for doing SVR (Support vector machine regression) in SPARK. Thanks R On Tue, Nov 29, 2016 at 6:50 PM, roni wrote: > Hi All, > I am trying to change my R code to spark. I am using SVM regression in R > . It seems like spark is providing SVM class

SVM regression in Spark

2016-11-29 Thread roni
can I do this in spark? Thanks in advance Roni

MLIB and R results do not match for SVD

2016-08-16 Thread roni
Hi All, Some time back I had asked the question about PCA results not matching between R and MLIB. I was suggested to use svd.v instead of PCA to match the uncentered PCA . But the results of mlib and R for svd do not match .(I can understand the numbers not matching exactly) but the distributi

Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni wrote: > Hi , > I want to get the bisecting kmeans tree structure to show a dendogram on > the heatmap I am generating based on the hierarchical clustering of data. > How d

bisecting kmeans model tree

2016-04-21 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni

bisecting kmeans tree

2016-04-20 Thread roni
Hi , I want to get the bisecting kmeans tree structure to show on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -R

Re: sparkR issues ?

2016-03-15 Thread roni
s.data.frame() > in SparkR to avoid such covering. > > > > *From:* Alex Kozlov [mailto:ale...@gmail.com] > *Sent:* Tuesday, March 15, 2016 2:59 PM > *To:* roni > *Cc:* user@spark.apache.org > *Subject:* Re: sparkR issues ? > > > > This seems to be a very unf

Re: sparkR issues ?

2016-03-15 Thread roni
es it's > own DataFrame class which shadows what seems to be your own definition. > > Is DataFrame something you define? Can you rename it? > > On Mon, Mar 14, 2016 at 10:44 PM, roni wrote: > >> Hi, >> I am working with bioinformatics and trying to convert some s

sparkR issues ?

2016-03-14 Thread roni
Hi, I am working with bioinformatics and trying to convert some scripts to sparkR to fit into other spark jobs. I tries a simple example from a bioinf lib and as soon as I start sparkR environment it does not work. code as follows - countData <- matrix(1:100,ncol=4) condition <- factor(c("A","A"

Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-19 Thread roni
me" to a DataFrame On Thu, Feb 18, 2016 at 9:03 PM, Felix Cheung wrote: > Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work > with Spark's DataFrame - try collect(countMat) and others to convert them > into data.frame? > > > __

cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-18 Thread roni
ntMat, colData = (colData), design = design) Error in DataFrame(colData, row.names = rownames(colData)) : cannot coerce class "data.frame" to a DataFrame I tried as.data.frame or using DataFrame to wrap the defs , but no luck. What Can I do differently? Thanks Roni

Upgrade spark cluster to latest version

2015-11-03 Thread roni
Hi Spark experts, This may be a very naive question but can you pl. point me to a proper way to upgrade spark version on an existing cluster. Thanks Roni > Hi, > I have a current cluster running spark 1.4 and want to upgrade to latest > version. > How can I do it without cr

upgrading from spark 1.4 to latest version

2015-11-02 Thread roni
Hi, I have a current cluster running spark 1.4 and want to upgrade to latest version. How can I do it without creating a new cluster so that all my other setting getting erased. Thanks _R

Re: connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-14 Thread roni
gards > > On Thu, Sep 10, 2015 at 11:20 PM, roni wrote: > >> I have spark installed on a EC2 cluster. Can I connect to that from my >> local sparkR in RStudio? if yes , how ? >> >> Can I read files which I have saved as parquet files on hdfs or s3 in >> sparkR ? If yes , How? >> >> Thanks >> -Roni >> >> >

reading files on HDFS /s3 in sparkR -failing

2015-09-10 Thread roni
read file on s3 , I get - java.io.IOException: No FileSystem for scheme: s3 Thanks in advance. -Roni

connecting to remote spark and reading files on HDFS or s3 in sparkR

2015-09-10 Thread roni
I have spark installed on a EC2 cluster. Can I connect to that from my local sparkR in RStudio? if yes , how ? Can I read files which I have saved as parquet files on hdfs or s3 in sparkR ? If yes , How? Thanks -Roni

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All, Any explanation for this? As Reece said I can do operations with hive but - val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error. I have already created spark ec2 cluster with the spark-ec2 script. How can I build it again? Thanks _Roni On Tue, Jul 28, 2015 at

No suitable driver found for jdbc:mysql://

2015-07-22 Thread roni
er=spark://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077 <http://ec2-52-25-191-999.us-west-2.compute.amazonaws.com:7077> --class "saveBedToDB" target/scala-2.10/adam-project_2.10-1.0.jar* *What else can I Do ?* *Thanks* *-Roni*

Re: which database for gene alignment data ?

2015-06-09 Thread roni
to save something in an external database, so that we can re-use the saved data in multiple ways by multiple people. Any suggestions on the DB selection or keeping data centralized for use by multiple distinct groups? Thanks -Roni On Mon, Jun 8, 2015 at 12:47 PM, Frank Austin Nothaft wrote

Re: which database for gene alignment data ?

2015-06-08 Thread roni
with the other .bed files. The data is huge. .bed files can range from .5 GB to 5 gb (or more) I was thinking of using cassandra, but not sue if the overlapping queries can be supported and will be fast enough. Thanks for the help -Roni On Sat, Jun 6, 2015 at 7:03 AM, Ted Yu wrote: > Can

which database for gene alignment data ?

2015-06-05 Thread roni
I want to use spark for reading compressed .bed file for reading gene sequencing alignments data. I want to store bed file data in db and then use external gene expression data to find overlaps etc, which database is best for it ? Thanks -Roni

Re: Is anyone using Amazon EC2? (second attempt!)

2015-05-29 Thread roni
Hi , Any update on this? I am not sure if the issue I am seeing is related .. I have 8 slaves and when I created the cluster I specified ebs volume with 100G. I see on Ec2 8 volumes created and each attached to the corresponding slave. But when I try to copy data on it , it complains that /root/ep

Storing data in MySQL from spark hive tables

2015-05-20 Thread roni
Hi , I am trying to setup the hive metastore and mysql DB connection. I have a spark cluster and I ran some programs and I have data stored in some hive tables. Now I want to store this data into Mysql so that it is available for further processing. I setup the hive-site.xml file.

spark.sql.Row manipulation

2015-03-31 Thread roni
I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in both towns . the resultant dataset is res4: Array[org.apache.spark.sql.Row] = Array([name1, age1, town1,name2,age2,town2]) Name 1 and name 2 are same as I am joining .

joining multiple parquet files

2015-03-31 Thread roni
) WHERE (DATE(TableC.date)=date(now())) I can do a 2 files join like - val joinedVal = g1.join(g2,g1.col("kmer") === g2.col("kmer")) But I am trying to find common kmer strings from 4 files. Thanks Roni

Re: Cannot run spark-shell "command not found".

2015-03-30 Thread roni
I think you must have downloaded the spark source code gz file. It is little confusing. You have to select the hadoop version also and the actual tgz file will have spark version and hadoop version in it. -R On Mon, Mar 30, 2015 at 10:34 AM, vance46 wrote: > Hi all, > > I'm a newbee try to se

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) Is There no way to upgrade without creating new cluster? Thanks Roni On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler wrote: > Yes, that's the problem. The RDD class exists in both binary jar files, &

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
ramming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Wed, Mar 25, 2015 at 12:09 PM, roni wrote: > >&

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
2 == 1) val bedPair = bedFile.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt)) * val joinRDD = bedPair.join(filtered) * Any idea whats going on? I have data on the EC2 so I am avoiding creating the new cluster , but just upgrading and changing the co

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
hat version of Spark do the other dependencies rely on (Adam and H2O?) - > that could be it > > Or try sbt clean compile > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Wed, Mar 25, 2015 at 5:58 PM, roni wrote: > >> I have a EC2 cluste

upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1. And I have a SBT project . Now I want to upgrade to spark 1.3 and use the new features. Below are issues . Sorry for the long post. Appreciate your help. Thanks -Roni Question - Do I have to create a new cluster using spark 1.3? Here is

Re: diffrence in PCA of MLib vs H2o in R

2015-03-24 Thread roni
MLlib implementation, since it > >> is computing the principal components by computing eigenvectors of the > >> covariance matrix. The means inherently don't matter either way in > >> this computation. > >> > >> On Tue, Mar 24, 2015 at 6:13 AM,

diffrence in PCA of MLib vs H2o in R

2015-03-23 Thread roni
e sure that the settings for MLib PCA is same as I am using for H2o or prcomp. Thanks Roni

FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-19 Thread roni
I get 2 types of error - -org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 and FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded Spar keeps re-trying to submit the code and keeps getting this error. My file on wh

saving or visualizing PCA

2015-03-18 Thread roni
Hi , I am generating PCA using spark . But I dont know how to save it to disk or visualize it. Can some one give me some pointerspl. Thanks -Roni

Re: Setting up Spark with YARN on EC2 cluster

2015-03-10 Thread roni
Hi Harika, Did you get any solution for this? I want to use yarn , but the spark-ec2 script does not support it. Thanks -Roni -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setting-up-Spark-with-YARN-on-EC2-cluster-tp21818p21991.html Sent from the Apache

distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn to mapreduce.framework.name . But when I try to to distcp , if I use uRI with s3://path to my bucket .. I get invalid path even though the bucket exists. If I use s3n:// it just hangs. Did anyone else face anything like that

Re: distcp on ec2 standalone spark cluster

2015-03-07 Thread roni
Did you get this to work? I got pass the issues with the cluster not startetd problem I am having problem where distcp with s3 URI says incorrect forlder path and s3n:// hangs. stuck for 2 days :( Thanks -R -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dis

spark-ec2 script problems

2015-03-05 Thread roni
70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) I tried doing start-all.sh , start-dfs.sh and start-yarn.sh what should I do ? Thanks -roni

Re: issue Running Spark Job on Yarn Cluster

2015-03-04 Thread roni
look at the logs yarn logs --applicationId That should give the error. On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh wrote: > Not yet, > Please let. Me know if you found solution, > > Regards > Sachin > On 4 Mar 2015 21:45, "mael2210 [via Apache Spark User List]" <[hidden > email]

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
ah!! I think I know what you mean. My job was just in "accepted" stage for a long time as it was running a huge file. But now that it is in running stage , I can see it . I can see it at post 9046 though instead of 4040 . But I can see it. Thanks -roni On Tue, Mar 3, 2015 at 1:19 PM,

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
_0007 -containerId > container_1386639398517_0007_01_19 > > Cheers > > On Tue, Mar 3, 2015 at 9:50 AM, roni wrote: > >> Hi Ted, >> I used s3://support.elasticmapreduce/spark/install-spark to install >> spark on my EMR cluster. It is 1.2.0. >> When I click on the link for history or

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
gt;> 2. when I click on Application Monitoring or history , i get re-directed >> to some linked with internal Ip address. Even if I replace that address >> with the public IP , it still does not work. What kind of setup changes >> are needed for that? >> >> Thanks &g