Re: need someone to help clear some questions.

2014-03-07 Thread Mayur Rustagi
groups.google.com/forum/#!forum/shark-users Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Mar 6, 2014 at 8:08 PM, qingyang li liqingyang1...@gmail.comwrote: Hi, Yana, do you know if there is mailing list for shark

Re: how to get size of rdd in memery

2014-03-07 Thread qingyang li
addtion : 1. i have run LOAD DATA INPATH '/user/root/input/test.txt' into table b; in shark. i think this will create rdd in memery, right? 2. when i run free -g , the result show somethings has been stored into memery. the file is almost 4g. [root@bigdata001

Re: Kryo serialization does not compress

2014-03-07 Thread pradeeps8
Hi Patrick, Thanks for your reply. I am guessing even an array type will be registered automatically. Is this correct? Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-serialization-does-not-compress-tp2042p2400.html Sent from the

java.lang.ClassNotFoundException in spark 0.9.0, shark 0.9.0 (pre-release) and hadoop 2.2.0

2014-03-07 Thread pradeeps8
Hi, We are currently trying to migrate to hadoop 2.2.0 and hence we have installed spark 0.9.0 and the pre-release version of shark 0.9.0. When we execute the script ( script.txt http://apache-spark-user-list.1001560.n3.nabble.com/file/n2401/script.txt ) we get the following error.

Re: major Spark performance problem

2014-03-07 Thread elyast
Hi, There is also an option to run spark applications on top of mesos in fine grained mode, then it is possible for fair scheduling (applications will run in parallel and mesos is responsible for scheduling all tasks) so in a sense all applications will progress in parallel, obviously it total in

Can anyone offer any insight at all?

2014-03-07 Thread Ognen Duzlevski
What is wrong with this code? A condensed set of this code works in the spark-shell. It does not work when deployed via a jar. def calcSimpleRetention(start:String,end:String,event1:String,event2:String):List[Double] = { val spd = new PipelineDate(start) val epd = new

Re: Can anyone offer any insight at all?

2014-03-07 Thread Ognen Duzlevski
Strike that. Figured it out. Don't you just hate it when you fire off an email and you figure it out as it is being sent? ;) Ognen On 3/7/14, 12:41 PM, Ognen Duzlevski wrote: What is wrong with this code? A condensed set of this code works in the spark-shell. It does not work when deployed

Re: Can anyone offer any insight at all?

2014-03-07 Thread Mayur Rustagi
the issue was with print? printing on worker? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 7, 2014 at 10:43 AM, Ognen Duzlevski og...@plainvanillagames.com wrote: Strike that. Figured it out. Don't you just

Re: Setting properties in core-site.xml for Spark and Hadoop to access

2014-03-07 Thread Mayur Rustagi
Set them as environment variable at boot configure both stacks to call on that.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 7, 2014 at 9:32 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On

Re: Can anyone offer any insight at all?

2014-03-07 Thread Ognen Duzlevski
No. It was a logical error. val ev1rdd = f.filter(_.split(,)(0).split(:)(1).replace(\,) == event1).map(line = (line.split(,)(2).split(:)(1).replace(\,),1)).cache should have mapped to ,0, not ,1 I have had the most awful time figuring out these looped things. It seems like it is next to

Re: Running actions in loops

2014-03-07 Thread Mayur Rustagi
Mostly the job you are executing is not serializable, this typically happens when you have a library that is not serializable.. are you using any library like jodatime etc ? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On

Re: Streaming JSON string from REST Api in Spring

2014-03-07 Thread Mayur Rustagi
Easiest is to use a queue, Kafka for example. So push your json request string into kafka, connect spark streaming to kafka pull data from it execute it. Spark streaming will split up the jobs pipeline the data. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Help connecting to the cluster

2014-03-07 Thread Yana Kadiyska
Hi Spark users, could someone help me out. My company has a fully functioning spark cluster with shark running on top of it (as part of the same cluster, on the same LAN) . I'm interested in running raw spark code against it but am running against the following issue -- it seems like the machine

[BLOG] Spark on Cassandra w/ Calliope

2014-03-07 Thread Brian O'Neill
FWIW - I posted some notes to help people get started quickly with Spark on C*. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html (tnx again to Rohit and team for all of their help) -brian -- Brian ONeill CTO, Health Market Science (http://healthmarketscience.com)

Re: [BLOG] Spark on Cassandra w/ Calliope

2014-03-07 Thread Ognen Duzlevski
Nice, thanks :) Ognen On 3/7/14, 2:48 PM, Brian O'Neill wrote: FWIW - I posted some notes to help people get started quickly with Spark on C*. http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html (tnx again to Rohit and team for all of their help) -brian -- Brian

Re: Running actions in loops

2014-03-07 Thread Ognen Duzlevski
Mayur, have not thought of that. Yes, I use jodatime. What is the scope that this serialization issue applies to? Only the method making a call into / using such a library? The whole class the method using such a library belongs to? Sorry if it is a dumb question :) Ognen On 3/7/14, 1:29 PM,

Re: Setting properties in core-site.xml for Spark and Hadoop to access

2014-03-07 Thread Nicholas Chammas
Mayur, So looking at the section on environment variables herehttp://spark.incubator.apache.org/docs/latest/configuration.html#environment-variables, are you saying to set these options via SPARK_JAVA_OPTS -D? On a related note, in looking around I just discovered this command line tool for

Class not found in Kafka-Stream due to multi-thread without correct ClassLoader?

2014-03-07 Thread Aries Kong
Hi, I'm trying to run a kafka-stream and get a strange exception. The streaming is created by following code: val lines = KafkaUtils.createStream[String, VtrRecord, StringDecoder, VtrRecordDeserializer](ssc, kafkaParams.toMap, topicpMap, StorageLevel.MEMORY_AND_DISK_SER_2) 'VtrRecord'

Re: Running actions in loops

2014-03-07 Thread Mayur Rustagi
So the whole function closure you want to apply on your RDD needs to be serializable so that it can be serialized sent to workers to operate on RDD. So objects of jodatime cannot be serialized sent hence jodatime is out of work. 2 bad answers 1. initialize jodatime for each row complete work

Re: Help connecting to the cluster

2014-03-07 Thread Mayur Rustagi
The driver contains the DAG scheduler which manages stages of jobs needs to talk back forth with workers. So you can run Driver on any machine that can reach master drivers(even your laptop). But Driver will need to be reachable to all machines. I think 0.9.0 added an ability for the driver to

Class not found in Kafka-Stream due to multi-thread without correct ClassLoader?

2014-03-07 Thread Aries Kong
Hi, I'm trying to run a kafka-stream and get a strange exception. The streaming is created by following code: val lines = KafkaUtils.createStream[String, VtrRecord, StringDecoder, VtrRecordDeserializer](ssc, kafkaParams.toMap, topicpMap, StorageLevel.MEMORY_AND_DISK_SER_2) 'VtrRecord'

Re: Explain About Logs NetworkWordcount.scala

2014-03-07 Thread Tathagata Das
I am not sure how to debug this without any more information about the source. Can you monitor on the receiver side that data is being accepted by the receiver but not reported? TD On Wed, Mar 5, 2014 at 7:23 AM, eduardocalfaia e.costaalf...@unibs.itwrote: Hi TD, I have seen in the web UI

Re: Running actions in loops

2014-03-07 Thread Nick Pentreath
There is #3 which is use mapPartitions and init one jodatime obj per partition, which is less overhead for large objects— Sent from Mailbox for iPhone On Sat, Mar 8, 2014 at 2:54 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: So the whole function closure you want to apply on your RDD needs