OFF_HEAP storage level

2014-07-04 Thread Ajay Srivastava
Hi, I was checking different storage level of an RDD and found OFF_HEAP. Has anybody used this level ? If i use this level, where will data be stored ? If not in heap, does it mean that we can avoid GC ? How can I use this level ? I did not find anything in archive regarding this. Can someone

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread zarzyk
Hi, I bump this thread as I'm also interested in the answer. Can anyone help or point to the information on how to do Spark Streaming from/to Cassandra? Thanks! Zarzyk -- View this message in context:

Re: write event logs with YARN

2014-07-04 Thread Christophe Préaud
Hi Andrew, Thanks for your explanation, I confirm that the entries show up in the history server UI when I create empty APPLICATION_COMPLETE files for each of them. Christophe. On 03/07/2014 18:27, Andrew Or wrote: Hi Christophe, another Andrew speaking. Your configuration looks fine to me.

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
Hi Zarzyk: If I were you, just to start, I would look at the following: https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
Hi Zarzyk: If I were you, just to start, I would look at the following: https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala

RE: Spark with HBase

2014-07-04 Thread N . Venkata Naga Ravi
Hi, Any update on the solution? We are still facing this issue... We could able to connect to HBase with independent code, but getting issue with Spark integration. Thx, Ravi From: nvn_r...@hotmail.com To: u...@spark.incubator.apache.org; user@spark.apache.org Subject: RE: Spark with HBase

Re: No FileSystem for scheme: hdfs

2014-07-04 Thread Juan Rodríguez Hortalá
Hi, To cope with the issue with META-INF that Sean is pointing out, my solution is replacing maven-assembly.plugin with maven-shade-plugin, using the ServicesResourceTransformer ( http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer)

RE: OFF_HEAP storage level

2014-07-04 Thread Shao, Saisai
Hi Ajay, StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the prerequisite is that you should deploy Tachyon among Spark. Yes, it can alleviate GC, since you offload JVM memory into system managed memory. You can use rdd.persist(...) to use this level, details can be checked

Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Hi! I have a Spark cluster running on top of a Cassandra cluster, using Datastax' new driver, and one of the fields of my RDDs is an XML-string. In a normal Scala sparkjob, parsing that data is no problem, but I would like to also make that information available through Spark SQL. So, is there

Graphx traversal and merge interesting edges

2014-07-04 Thread HHB
Hello Gurus, Pardon me I am noob @ Spark GraphX ( Scala) And I seek your wisdome here.. I want to know how to do a graph traversal and do selective merge on edges... Thanks to the documentation :-) I could create a simple graph of employees their colleagues. The a structure of Graph is

spark and mesos issue

2014-07-04 Thread Gurvinder Singh
We are getting this issue when we are running jobs with close to 1000 workers. Spark is from the github version and mesos is 0.19.0 ERROR storage.BlockManagerMasterActor: Got two different block manager registrations on 201407031041-1227224054-5050-24004-0 Googling about it seems that mesos is

Spark memory optimization

2014-07-04 Thread Igor Pernek
Hi all! I have a folder with 150 G of txt files (around 700 files, on average each 200 MB). I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that: - manually loop through all the files, do the calculations per file and

Fwd: Graphx traversal and merge interesting edges

2014-07-04 Thread H Bolke
Hello Gurus, Pardon me I am noob @ Spark GraphX ( Scala) And I seek your wisdome here.. I want to know how to do a graph traversal and do selective merge on edges... Thanks to the documentation :-) I could create a simple graph of employees their colleagues. The a structure of Graph is below,

Re: Spark SQL user defined functions

2014-07-04 Thread Takuya UESHIN
Hi, You can convert standard RDD of Product class (e.g. case class) to SchemaRDD by SQLContext. Load data from Cassandra into RDD of case class, convert it to SchemaRDD and register it, then you can use it in your SQLs.

Re: Spark SQL user defined functions

2014-07-04 Thread Martin Gammelsæter
Takuya, thanks for your reply :) I am already doing that, and it is working well. My question is, can I define arbitrary functions to be used in these queries? On Fri, Jul 4, 2014 at 11:12 AM, Takuya UESHIN ues...@happy-camper.st wrote: Hi, You can convert standard RDD of Product class (e.g.

Re: issue with running example code

2014-07-04 Thread Gurvinder Singh
In the end it turns out that the issue was caused by a config settings in spark-defaults.conf. After removing this setting spark.files.userClassPathFirst true things are back to normal. Just reporting in case f someone will have the same issue. - Gurvinder On 07/03/2014 06:49 PM, Gurvinder

matchError:null in ALS.train

2014-07-04 Thread Honey Joshi
Original Message Subject: matchError:null in ALS.train From:Honey Joshi honeyjo...@ideata-analytics.com Date:Thu, July 3, 2014 8:12 am To: user@spark.apache.org

Re: Spark SQL user defined functions

2014-07-04 Thread Takuya UESHIN
Ah, sorry for misreading. I don't think there is a way to use UDF in your SQLs only with SparkSQL. You might be able to use with SparkHive, but I'm sorry, I don't know well. I think you should use the function before convert to SchemaRDD if you can. Thanks. 2014-07-04 18:16 GMT+09:00 Martin

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter martingammelsae...@gmail.com wrote: is there any way to write user defined functions for Spark SQL? This is coming in Spark 1.1. There is a work in progress PR here: https://github.com/apache/spark/pull/1063 If you have a hive context, you

sparck Stdout and stderr

2014-07-04 Thread aminn_524
0 down vote favorite I am running spark-1.0.0 by connecting to a spark standalone cluster which has one master and two slaves. I ran wordcount.py by Spark-submit, actually it reads data from HDFS and also write the results into HDFS. So far everything is fine and the results will correctly be

Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-04 Thread Konstantin Kudryavtsev
Hi all, I stuck in issue with runing spark PI example on HDP 2.0 I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html (for HDP2) The run example from spark web-site: ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3

RE: No FileSystem for scheme: hdfs

2014-07-04 Thread Steven Cox
Thanks for the help folks. Adding the config files was necessary but not sufficient. I also had hadoop 1.0.4 classes on the classpath because a bad jar: spark-0.9.1/jars/spark-assembly-0.9.1-hadoop1.0.4.jar was in my spark executor tar.gz (stored in HDFS). I believe this was due to a bit

Re: Spark memory optimization

2014-07-04 Thread Mayur Rustagi
I would go with Spark only if you are certain that you are going to scale out in the near future. You can change the default storage of RDD to DISK_ONLY, that might remove issues around any rdd leveraging memory. Thr are some functions particularly sortbykey that require data to fit in memory to

Re: Spark memory optimization

2014-07-04 Thread Surendranauth Hiraman
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure you are writing to multiple disks for best operation. And even with DISK_ONLY, we've found that there is a minimum threshold for executor ram (spark.executor.memory), which for us seemed to be around 8 GB. If you find

window analysis with Spark and Spark streaming

2014-07-04 Thread alessandro finamore
Hi, I have a large dataset of text logs files on which I need to implement window analysis Say, extract per-minute data and do aggregated stats on the last X minutes I've to implement the windowing analysis with spark. This is the workflow I'm currently using - read a file and I create a new RDD

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Mayur Rustagi
What I typically do is use row_number subquery to filter based on that. It works out pretty well, reduces the iteration. I think a offset solution based on windowsing directly would be useful. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Hi everybody Someone can tell me if it is possible to read and filter a 60 GB file of tweets (Json Docs) in a Standalone Spark Deployment that runs in a single machine with 40 Gb RAM and 8 cores??? I mean, is it possible to configure Spark to work with some amount of memory (20 GB) and the rest

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Ok i find this slides of Yin Huai ( http://spark-summit.org/wp-content/uploads/2014/07/Easy-json-Data-Manipulation-Yin-Huai.pdf ) to read a Json file the code seem pretty simple : sqlContext.jsonFile(data.json) Is this already available in the master branch??? But the question about the

Re: Distribute data from Kafka evenly on cluster

2014-07-04 Thread Tobias Pfeiffer
Hi, unfortunately, when I go the above approach, I run into this problem: http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccabtfevyxvtaqvnmvwmh7yscfgxpw5kmrnw_gnq72cy4oa1b...@mail.gmail.com%3E That is, a NoNode error in Zookeeper when rebalancing. The Kafka receiver will retry

Re: Visualize task distribution in cluster

2014-07-04 Thread Mayur Rustagi
You'll get most of that information from mesos interface. You may not get transfer of data information particularly. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jul 3, 2014 at 6:28 AM, Tobias Pfeiffer

Re: Spark job tracker.

2014-07-04 Thread Mayur Rustagi
The application server doesnt provide json api unlike the cluster interface(8080). If you are okay to patch spark, you can use our patch to create json API, or you can use sparklistener interface in your application to get that info out. Mayur Rustagi Ph: +1 (760) 203 3257

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a HadoopRDD from that. On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote: Hi, I noticed mention of DynamoDB as input source in

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote: You should be able to use DynamoDBInputFormat (I think this should be part of AWS libraries for Java) and create a

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
No boto support for that. In master there is Python support for loading Hadoop inputFormat. Not sure if it will be in 1.0.1 or 1.1 I master docs under the programming guide are instructions and also under examples project there are pyspark examples of using Cassandra and HBase. These should

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
I should qualify by saying there is boto support for dynamodb - but not for the inputFormat. You could roll your own python-based connection but this involves figuring out how to split the data in dynamo - inputFormat takes care of this so should be the easier approach — Sent from Mailbox On

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Trying to discover source for the DynamoDBInputFormat. Not appearing in: - https://github.com/aws/aws-sdk-java - https://github.com/apache/hive Then came across http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb. Unsure whether this represents the latest

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath
Interesting - I would have thought they would make that available publicly. Unfortunately, unless you can use Spark on EMR, I guess your options are to hack it by spinning up an EMR cluster and getting the JAR, or maybe fall back to using boto and rolling your own :( On Fri, Jul 4, 2014 at 9:28

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities. On Friday, July 4, 2014 9:52 AM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: You should think about a custom receiver, in order to solve the problem of the “already collected”

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread alessandro finamore
Thanks for the replies What is not completely clear to me is how time is managed. I can create a DStream from file. But if I set the window property that will be bounded to the application time, right? If I got it right, with a receiver I can control the way DStream are created. But, how can

pyspark + yarn: how everything works.

2014-07-04 Thread Egor Pahomov
Hi, I want to use pySpark with yarn. But documentation doesn't give me full understanding on what's going on, and I simply don't understand code. So: 1) How python shipped to cluster? Should machines in cluster already have python? 2) What happens when I write some python code in map function -

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
The windowing capabilities of spark streaming determine the events in the RDD created for that time window.  If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream. On Friday, July 4, 2014 1:28 PM,

Re: Spark SQL user defined functions

2014-07-04 Thread Michael Armbrust
Sweet. Any idea about when this will be merged into master? It is probably going to be a couple of weeks. There is a fair amount of cleanup that needs to be done. It works though and we used it in most of the demos at the spark summit. Mostly I just need to add tests and move it out of

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Michael Armbrust
sqlContext.jsonFile(data.json) Is this already available in the master branch??? Yes, and it will be available in the soon to come 1.0.1 release. But the question about the use a combination of resources (Memory processing Disk processing) still remains. This code should work

Re: LIMIT with offset in SQL queries

2014-07-04 Thread Michael Armbrust
Though I'll note that window functions are not yet supported in Spark SQL. https://issues.apache.org/jira/browse/SPARK-1442 On Fri, Jul 4, 2014 at 6:59 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: What I typically do is use row_number subquery to filter based on that. It works out pretty

Re: SQL FIlter of tweets (json) running on Disk

2014-07-04 Thread Abel Coronado Iruegas
Thank you, DataBricks Rules On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust mich...@databricks.com wrote: sqlContext.jsonFile(data.json) Is this already available in the master branch??? Yes, and it will be available in the soon to come 1.0.1 release. But the question about

RE: How to use groupByKey and CqlPagingInputFormat

2014-07-04 Thread Mohammed Guller
As far as I know, there is not much difference, except that the outer parenthesis is redundant. The problem with your original code was that there was mismatch in the opening and closing parenthesis. Sometimes the error messages are misleading :-) Do you see any performance difference with the

Re: reading compress lzo files

2014-07-04 Thread Gurvinder Singh
an update on this issue, now spark is able to read the lzo file and recognize that it has index and starts multiple map tasks. you need to use following function instead of textFile csv =

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-04 Thread Thomas Robert
Hi all, I too am having some issues with *RegressionWithSGD algorithms. Concerning your issue Eustache, this could be due to the fact that these regression algorithms uses a fixed step (that is divided by sqrt(iteration)). During my tests, quite often, the algorithm diverged an infinity cost, I

Java sample for using cassandra-driver-spark

2014-07-04 Thread M Singh
Hi: Is there a Java sample fragment for using cassandra-driver-spark ? Thanks

classnotfound error due to groupByKey

2014-07-04 Thread Joe L
Hi, When I run the following a piece of code, it is throwing a classnotfound error. Any suggestion would be appreciated. Wanted to group an RDD by key: val t = rdd.groupByKey() Error message: java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$

Spark streaming kafka cost long time at take at DStream.scala:586

2014-07-04 Thread xiemeilong
I am using : kafka 0.8.1 spark-streaming-kafka_2.10-0.9.0-cdh5.0.2 My analysis is simple, so I confuse why it cost so long time at take at DStream.scala:586, it cost 2 to 8 minutes or longer .I don't know how to find the reason. Hoping your help. Sorry for my poor english. -- View this