Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Hi, Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly.. I took a look at the process table and saw that the “org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the -Dspark.local.dir set. On 28 Mar, 2014, at 1:05 pm, Matei Zaharia wrote: > I see, are

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
Oh sorry, that was a mistake, the default level is MEMORY_ONLY !! My doubt was, between two different experiments, are the RDDs cached in memory need to be unpersisted??? Or it doesnt matter ?

Re: Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Thanks. Got it working. On 28 Mar, 2014, at 2:02 pm, Aaron Davidson wrote: > Assuming you're using a new enough version of Spark, you should use > spark.executor.memory to set the memory for your executors, without changing > the driver memory. See the docs for your version of Spark. > > > O

Re: Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Aaron Davidson
Assuming you're using a new enough version of Spark, you should use spark.executor.memory to set the memory for your executors, without changing the driver memory. See the docs for your version of Spark. On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming wrote: > Hi, > > My worker nodes have more me

Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, yes, you may indeed be overthinking it a bit, about how Spark executes maps/filters etc. I'll focus on the high-order bits so it's clear. Let's assume you're doing this in Java. Then you'd pass some *MyMapper*instance to J *avaRDD#map(myMapper)*. So you'd have a class *MyMapper extends Fu

Setting SPARK_MEM higher than available memory in driver

2014-03-27 Thread Tsai Li Ming
Hi, My worker nodes have more memory than the host that I’m submitting my driver program, but it seems that SPARK_MEM is also setting the Xmx of the spark shell? $ SPARK_MEM=100g MASTER=spark://XXX:7077 bin/spark-shell Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x7f

Re: Configuring shuffle write directory

2014-03-27 Thread Matei Zaharia
I see, are you sure that was in spark-env.sh instead of spark-env.sh.template? You need to copy it to just a .sh file. Also make sure the file is executable. Try doing println(sc.getConf.toDebugString) in your driver program and seeing what properties it prints. As far as I can tell, spark.local

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Yes, I have tried that by adding it to the Worker. I can see the "app-20140328124540-000” in the local spark directory of the worker. But the “spark-local” directories are always written to /tmp since is the default spark.local.dir is taken from java.io.tempdir? On 28 Mar, 2014, at 12:42 pm,

Re: Configuring shuffle write directory

2014-03-27 Thread Matei Zaharia
Yes, the problem is that the driver program is overriding it. Have you set it manually in the driver? Or how did you try setting it in workers? You should set it by adding export SPARK_JAVA_OPTS=“-Dspark.local.dir=whatever” to conf/spark-env.sh on those workers. Matei On Mar 27, 2014, at 9:04

Re: Configuring shuffle write directory

2014-03-27 Thread Tsai Li Ming
Anyone can help? How can I configure a different spark.local.dir for each executor? On 23 Mar, 2014, at 12:11 am, Tsai Li Ming wrote: > Hi, > > Each of my worker node has its own unique spark.local.dir. > > However, when I run spark-shell, the shuffle writes are always written to > /tmp des

Replicating RDD elements

2014-03-27 Thread David Thomas
How can we replicate RDD elements? Say I have 1 element and 100 nodes in the cluster. I need to replicate this one item on all the nodes i.e. effectively create an RDD of 100 elements.

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
I see, did this also fail with previous versions of Spark (0.9 or 0.8)? We’ll try to look into these, seems like a serious error. Matei On Mar 27, 2014, at 7:27 PM, Jim Blomo wrote: > Thanks, Matei. I am running "Spark 1.0.0-SNAPSHOT built for Hadoop > 1.0.4" from GitHub on 2014-03-18. > > I

ArrayIndexOutOfBoundsException in ALS.implicit

2014-03-27 Thread bearrito
Usage of negative product id's causes the above exception. The cause is the use of the product id's as a mechanism to index into the the in and out block structures. Specifically on 9.0 it occurs at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$make

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
I dint mention anything, so by default it should be MEMORY_AND_DISK right? My doubt was, between two different experiments, are the RDDs cached in memory need to be unpersisted??? Or it doesnt matter ? On Fri, Mar 28, 2014 at 1:43 AM, Syed A. Hashmi wrote: > Which storage scheme are you using?

Re: pySpark memory usage

2014-03-27 Thread Jim Blomo
Thanks, Matei. I am running "Spark 1.0.0-SNAPSHOT built for Hadoop 1.0.4" from GitHub on 2014-03-18. I tried batchSizes of 512, 10, and 1 and each got me further but none have succeeded. I can get this to work -- with manual interventions -- if I omit `parsed.persist(StorageLevel.MEMORY_AND_DISK

Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-27 Thread Kanwaldeep
We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9 with Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar deployed on each of the spark worker nodes. The message is compiled using 2.5 but then on runtime it is being de-serialized by 2.4.1 as I'm getting the

Re: Run spark on mesos remotely

2014-03-27 Thread Wush Wu
Dear Rustagi, Thanks for you response. As far as I know, the DAG scheduler should be a part of spark. Therefore, should I do something not mentioned in http://spark.incubator.apache.org/docs/0.8.1/running-on-mesos.html to launch the DAG scheduler? By the way, I also notice that the user of the s

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 02:10, Scott Clasen wrote: > Thanks everyone for the discussion. > > Just to note, I restarted the job yet again, and this time there are indeed > tasks being executed by both worker nodes. So the behavior does seem > inconsistent/broken atm. > > Then I added a third node to

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Christopher Sorry I might be missing the obvious, but how do i get my function called on all Executors used by the app? I dont want to use RDDs unless necessary. once I start my shell or app, how do I get TaskNonce.getSingleton().doThisOnce() executed on each executor? @dmpour >>rdd.mapPartitio

RDD[Array] question

2014-03-27 Thread Walrus theCat
Sup y'all, If I have an RDD[Array], if I do some operation in the RDD, then each Array is going to get instantiated on some individual machine, correct (or does it spread it out?) Thanks

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Thanks everyone for the discussion. Just to note, I restarted the job yet again, and this time there are indeed tasks being executed by both worker nodes. So the behavior does seem inconsistent/broken atm. Then I added a third node to the cluster, and a third executor came up, and everything brok

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:38, Evgeny Shishkin wrote: > > On 28 Mar 2014, at 01:32, Tathagata Das wrote: > >> Yes, no one has reported this issue before. I just opened a JIRA on what I >> think is the main problem here >> https://spark-project.atlassian.net/browse/SPARK-1340 >> Some of the receiv

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:44, Tathagata Das wrote: > The more I think about it the problem is not about /tmp, its more about the > workers not having enough memory. Blocks of received data could be falling > out of memory before it is getting processed. > BTW, what is the storage level that you a

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Tathagata Das
The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling out of memory before it is getting processed. BTW, what is the storage level that you are using for your input stream? If you are using MEMORY_ONLY,

Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:37, Tathagata Das wrote: > I see! As I said in the other thread, no one reported these issues until now! > A good and not-too-hard fix is to add the functionality of the limiting the > data rate that the receivers receives at. I have opened a JIRA. > Yes, actually you

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:32, Tathagata Das wrote: > Yes, no one has reported this issue before. I just opened a JIRA on what I > think is the main problem here > https://spark-project.atlassian.net/browse/SPARK-1340 > Some of the receivers dont get restarted. > I have a bunch refactoring in the N

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
I see! As I said in the other thread, no one reported these issues until now! A good and not-too-hard fix is to add the functionality of the limiting the data rate that the receivers receives at. I have opened a JIRA. TD On Thu, Mar 27, 2014 at 3:28 PM, Evgeny Shishkin wrote: > > On 28 Mar 2014

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Tathagata Das
Yes, no one has reported this issue before. I just opened a JIRA on what I think is the main problem here https://spark-project.atlassian.net/browse/SPARK-1340 Some of the receivers dont get restarted. I have a bunch refactoring in the NetworkReceiver ready to be posted as a PR that should fix this

Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:13, Tathagata Das wrote: > Seems like the configuration of the Spark worker is not right. Either the > worker has not been given enough memory or the allocation of the memory to > the RDD storage needs to be fixed. If configured correctly, the Spark workers > should not

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 01:11, Scott Clasen wrote: > Evgeniy Shishkin wrote >> So, at the bottom — kafka input stream just does not work. > > > That was the conclusion I was coming to as well. Are there open tickets > around fixing this up? > I am not aware of such. Actually nobody complained on

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the worker has not been given enough memory or the allocation of the memory to the RDD storage needs to be fixed. If configured correctly, the Spark workers should not get OOMs. On Thu, Mar 27, 2014 at 2:52 PM, Evgeny Shishkin

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sandy Ryza
That bug only appears to apply to spark-shell. Do things work in yarn-client mode or on a standalone cluster? Are you passing a path with parent directories to addJar? On Thu, Mar 27, 2014 at 3:01 PM, Sung Hwan Chung wrote: > Well, it says that the jar was successfully added but can't referenc

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Evgeniy Shishkin wrote > So, at the bottom — kafka input stream just does not work. That was the conclusion I was coming to as well. Are there open tickets around fixing this up? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Evgeny Shishkin
On 28 Mar 2014, at 00:34, Scott Clasen wrote: Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to spark executors and ta

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Well, it says that the jar was successfully added but can't reference classes from it. Does this have anything to do with this bug? http://stackoverflow.com/questions/22457645/when-to-use-spark-classpath-or-sparkcontext-addjar On Thu, Mar 27, 2014 at 2:57 PM, Sandy Ryza wrote: > I just tried t

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sandy Ryza
I just tried this in CDH (only a few patches ahead of 0.9.0) and was able to include a dependency with --addJars successfully. Can you share how you're invoking SparkContext.addJar? Anything interesting in the application master logs? -Sandy On Thu, Mar 27, 2014 at 11:35 AM, Sung Hwan Chung

Re: spark streaming and the spark shell

2014-03-27 Thread Evgeny Shishkin
> >> 2. I notice that once I start ssc.start(), my stream starts processing and >> continues indefinitely...even if I close the socket on the server end (I'm >> using unix command "nc" to mimic a server as explained in the streaming >> programming guide .) Can I tell my stream to detect if it's

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
Actually looking closer it is stranger than I thought, in the spark UI, one executor has executed 4 tasks, and one has executed 1928 Can anyone explain the workings of a KafkaInputStream wrt kafka partitions and mapping to spark executors and tasks? -- View this message in context: http:/

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
Heh sorry that wasnt a clear question, I know 'how' to set it but dont know what value to use in a mesos cluster, since the processes are running in lxc containers they wont be sharing a filesystem (or machine for that matter) I cant use an s3n:// url for local dir can I? -- View this message

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Tathagata Das
spark.local.dir should be specified in the same way as other configuration parameters. On Thu, Mar 27, 2014 at 10:32 AM, Scott Clasen wrote: > I think now that this is because spark.local.dir is defaulting to /tmp, and > since the tasks a

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Very good questions! Responses inline. TD On Thu, Mar 27, 2014 at 8:02 AM, Diana Carroll wrote: > I'm working with spark streaming using spark-shell, and hoping folks could > answer a few questions I have. > > I'm doing WordCount on a socket stream: > > import org.apache.spark.streaming.Streamin

how to create a DStream from bunch of RDDs

2014-03-27 Thread Adrian Mocanu
I create several RDDs by merging several consecutive RDDs from a DStream. Is there a way to add these new RDDs to a DStream? -Adrian

Re: GC overhead limit exceeded

2014-03-27 Thread Syed A. Hashmi
Which storage scheme are you using? I am guessing it is MEMORY_ONLY. In large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better. You can call unpersist on an RDD to remove it from Cache though. On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna wrote: > No i am running on 0.8.1. > Yes i

Re: Configuring distributed caching with Spark and YARN

2014-03-27 Thread Mayur Rustagi
is this equivalent to addjar? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 27, 2014 at 3:58 AM, santhoma wrote: > Curious to know, were you able to do distributed caching for spark? > > I have done that for

Re:

2014-03-27 Thread Mayur Rustagi
You have to raise the global limit as root. Also you have to do that on the whole cluster. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 27, 2014 at 4:07 AM, Hahn Jiang wrote: > I set "ulimit -n

Re: Run spark on mesos remotely

2014-03-27 Thread Mayur Rustagi
Yes but you have to maintain connection of that machine to the master cluster as the driver with DAG scheduler runs on that machine. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 27, 2014 at 4:09

Re: spark streaming: what is awaitTermination()?

2014-03-27 Thread Tathagata Das
The execution of Spark Streaming (started with StreamingContext.start()) can stop in two ways. 1. steamingContext.stop() is called (could be from a different thread) 2. some exception occurs in the processing of data. awaitTermination is the right way for the main thread that started the context t

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Patrick Wendell
If you call repartition() on the original stream you can set the level of parallelism after it's ingested from Kafka. I'm not sure how it maps kafka topic partitions to tasks for the ingest thought. On Thu, Mar 27, 2014 at 11:09 AM, Scott Clasen wrote: > I have a simple streaming job that create

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
Hey Rohit, I think external tables based on Cassandra or other datastores will work out-of-the box if you build Catalyst with Hive support. Michael may have feelings about this but I'd guess the longer term design for having schema support for Cassandra/HBase etc likely wouldn't rely on hive exte

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
No i am running on 0.8.1. Yes i am caching a lot, i am benchmarking a simple code in spark where in 512mb, 1g and 2g text files are taken, some basic intermediate operations are done while the intermediate result which will be used in subsequent operations are cached. I thought that, we need not m

Re: YARN problem using an external jar in worker nodes Inbox x

2014-03-27 Thread Sung Hwan Chung
Yea it's in a standalone mode and I did use SparkContext.addJar method and tried setting setExecutorEnv "SPARK_CLASSPATH", etc. but none of it worked. I finally made it work by modifying the ClientBase.scala code where I set 'appMasterOnly' to false before the addJars contents were added to distCa

KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Scott Clasen
I have a simple streaming job that creates a kafka input stream on a topic with 8 partitions, and does a forEachRDD The job and tasks are running on mesos, and there are two tasks running, but only 1 task doing anything. I also set spark.streaming.concurrentJobs=8 but still there is only 1 task

Re: GC overhead limit exceeded

2014-03-27 Thread Andrew Or
Are you caching a lot of RDD's? If so, maybe you should unpersist() the ones that you're not using. Also, if you're on 0.9, make sure spark.shuffle.spill is enabled (which it is by default). This allows your application to spill in-memory content to disk if necessary. How much memory are you givin

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-27 Thread Scott Clasen
I think now that this is because spark.local.dir is defaulting to /tmp, and since the tasks are not running on the same machine, the file is not found when the second task takes over. How do you set spark.local.dir appropriately when running on mesos? -- View this message in context: http://ap

Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, dmpour is correct in that there's a many-to-many mapping between executors and partitions (an executor can be assigned multiple partitions, and a given partition can in principle move a different executor). I'm not sure why you seem to require this problem statement to be solved with RDDs.

Re: GC overhead limit exceeded

2014-03-27 Thread Ognen Duzlevski
Look at the tuning guide on Spark's webpage for strategies to cope with this. I have run into quite a few memory issues like these, some are resolved by changing the StorageLevel strategy and employing things like Kryo, some are solved by specifying the number of tasks to break down a given ope

Re: Running a task once on each executor

2014-03-27 Thread dmpour23
How exactly does rdd.mapPartitions be executed once in each VM? I am running mapPartitions and the call function seems not to execute the code? JavaPairRDD twos = input.map(new Split()).sortByKey().partitionBy(new HashPartitioner(k)); twos.values().saveAsTextFile(args[2]); JavaRDD ls = twos.va

RE: StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Please disregard I didn't see the Seq wrapper. From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: March-27-14 11:57 AM To: u...@spark.incubator.apache.org Subject: StreamingContext.transform on a DStream Found this transform fn in StreamingContext which takes in a DStream[_] and a func

StreamingContext.transform on a DStream

2014-03-27 Thread Adrian Mocanu
Found this transform fn in StreamingContext which takes in a DStream[_] and a function which acts on each of its RDDs Unfortunately I can't figure out how to transform my DStream[(String,Int)] into DStream[_] /*** Create a new DStream in which each RDD is generated by applying a function on RD

Re: GC overhead limit exceeded

2014-03-27 Thread Sean Owen
This is another way of Java saying "you ran out of heap space". As less and less room is available, the GC kicks in more often, freeing less each time. Before the very last byte of memory is gone, Java may declare defeat. That's why it's taking so long, and you simply need a larger heap in whatever

GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
"java.lang.OutOfMemoryError: GC overhead limit exceeded" What is the problem. The same code, i run, one instance it runs in 8 second, next time it takes really long time, say 300-500 seconds... I see the logs a lot of GC overhead limit exceeded is seen. What should be done ?? Please can someone t

Spark powered wikipedia analysis and exploration

2014-03-27 Thread Guillaume Pitel
Hi Spark users, I don't know if it's the right place to announce it, but Spark has a new visible use case through a demo we put online here : http://wikinsights.org It allows you to explore the English Wikipedia with a few added benefits from our propri

spark streaming and the spark shell

2014-03-27 Thread Diana Carroll
I'm working with spark streaming using spark-shell, and hoping folks could answer a few questions I have. I'm doing WordCount on a socket stream: import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds var s

function state lost when next RDD is processed

2014-03-27 Thread Adrian Mocanu
Is there a way to pass a custom function to spark to run it on the entire stream? For example, say I have a function which sums up values in each RDD and then across RDDs. I've tried with map, transform, reduce. They all apply my sum function on 1 RDD. When the next RDD comes the function start

spark streaming: what is awaitTermination()?

2014-03-27 Thread Diana Carroll
The API docs for ssc.awaitTermination say simply "Wait for the execution to stop. Any exceptions that occurs during the execution will be thrown in this thread." Can someone help me understand what this means? What causes execution to stop? Why do we need to wait for that to happen? I tried rem

WikipediaPageRank Data Set

2014-03-27 Thread Niko Stahl
Hello, I would like to run the WikipediaPageRankexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
when there is something new, it's also cool to let imagination fly far away ;) On Thu, Mar 27, 2014 at 2:20 PM, andy petrella wrote: > Yes it could, of course. I didn't say that there is no tool to do it, > though ;-). > > Andy > > > On Thu, Mar 27, 2014 at 12:49 PM, yana wrote: > >> Does Shark

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
Yes it could, of course. I didn't say that there is no tool to do it, though ;-). Andy On Thu, Mar 27, 2014 at 12:49 PM, yana wrote: > Does Shark not suit your needs? That's what we use at the moment and it's > been good > > > Sent from my Samsung Galaxy S®4 > > > Original message ---

Re: Announcing Spark SQL

2014-03-27 Thread yana
Does Shark not suit your needs? That's what we use at the moment and it's been good Sent from my Samsung Galaxy S®4 Original message From: andy petrella Date:03/27/2014 6:08 AM (GMT-05:00) To: user@spark.apache.org Subject: Re: Announcing Spark SQL nope (what I said :-

Re: Running a task once on each executor

2014-03-27 Thread deenar.toraskar
Hi Christopher >>which you would invoke as TaskNonce.getSingleton().doThisOnce() from within the map closure. Say I have a cluster with 24 workers (one thread per worker SPARK_WORKER_CORES). My application would have 24 executors each with its own VM. The RDDs i process have millions of rows and

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 11:08 AM, andy petrella wrote: > nope (what I said :-P) > That's also my answer to my own question :D but I didn't understand that in your sentence: "my2c is that this feature is also important to enable ad-hoc queries which is done at runtime." > > > On Thu, Mar 27, 20

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
nope (what I said :-P) On Thu, Mar 27, 2014 at 11:05 AM, Pascal Voitot Dev < pascal.voitot@gmail.com> wrote: > > > > On Thu, Mar 27, 2014 at 10:22 AM, andy petrella > wrote: > >> I just mean queries sent at runtime ^^, like for any RDBMS. >> In our project we have such requirement to have a

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
On Thu, Mar 27, 2014 at 10:22 AM, andy petrella wrote: > I just mean queries sent at runtime ^^, like for any RDBMS. > In our project we have such requirement to have a layer to play with the > data (custom and low level service layer of a lambda arch), and something > like this is interesting. >

Re: Change print() in JavaNetworkWordCount

2014-03-27 Thread Eduardo Costa Alfaia
Thank you very much Sourav BR Em 3/26/14, 17:29, Sourav Chandra escreveu: def print() { def foreachFunc = (rdd: RDD[T], time: Time) => { val total = rdd.collect().toList println ("---") println ("Time: " + time) println ("

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
I just mean queries sent at runtime ^^, like for any RDBMS. In our project we have such requirement to have a layer to play with the data (custom and low level service layer of a lambda arch), and something like this is interesting. On Thu, Mar 27, 2014 at 10:15 AM, Pascal Voitot Dev < pascal.voi

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
Le 27 mars 2014 09:47, "andy petrella" a écrit : > > I hijack the thread, but my2c is that this feature is also important to enable ad-hoc queries which is done at runtime. It doesn't remove interests for such macro for precompiled jobs of course, but it may not be the first use case envisioned wi

java.lang.NoClassDefFoundError: org/apache/spark/util/Vector

2014-03-27 Thread Kal El
I am getting this error when I try to run K-Means in spark-0.9.0: "java.lang.NoClassDefFoundError: org/apache/spark/util/Vector         at java.lang.Class.getDeclaredMethods0(Native Method)         at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)         at java.lang.Class.getMethod0(C

Re: Announcing Spark SQL

2014-03-27 Thread andy petrella
I hijack the thread, but my2c is that this feature is also important to enable ad-hoc queries which is done at runtime. It doesn't remove interests for such macro for precompiled jobs of course, but it may not be the first use case envisioned with this Spark SQL. Again, only my0.2c (ok I divided b

Re: Announcing Spark SQL

2014-03-27 Thread Pascal Voitot Dev
Hi, Quite interesting! Suggestion: why not go even fancier & parse SQL queries at compile-time with a macro ? ;) Pascal On Wed, Mar 26, 2014 at 10:58 PM, Michael Armbrust wrote: > Hey Everyone, > > This already went out to the dev list, but I wanted to put a pointer here > as well to a new fe

Run spark on mesos remotely

2014-03-27 Thread Wush Wu
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. It works if I submit the job from the master of mesos. That is to say, I spawn the spark shell or launch the scala application on the master of mesos. However, when I submit the job from another machine, the job will lost. The logs shows that

Re:

2014-03-27 Thread Hahn Jiang
I set "ulimit -n 10" in conf/spark-env.sh, is it too small? On Thu, Mar 27, 2014 at 3:36 PM, Sonal Goyal wrote: > Hi Hahn, > > What's the ulimit on your systems? Please check the following link for a > discussion on the too many files open. > > > http://mail-archives.apache.org/mod_mbox/spa

Re: Configuring distributed caching with Spark and YARN

2014-03-27 Thread santhoma
Curious to know, were you able to do distributed caching for spark? I have done that for hadoop and pig, but could not find a way to do it in spark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-distributed-caching-with-Spark-and-YARN-tp1074p33

Re:

2014-03-27 Thread Sonal Goyal
Hi Hahn, What's the ulimit on your systems? Please check the following link for a discussion on the too many files open. http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3ccangvg8qpn_wllsrcjegdb7hmza2ux7myxzhfvtz+b-sdxdk...@mail.gmail.com%3E Sent from my iPad > On Mar 27, 2014

Re: How to set environment variable for a spark job

2014-03-27 Thread santhoma
Got it finally, pasting it here so that it will be useful for others val conf = new SparkConf() .setJars(jarList); conf.setExecutorEnv("ORACLE_HOME", myOraHome) conf.setExecutorEnv("SPARK_JAVA_OPTS", "-Djava.library.path=/my/custom/path") -- View this message in context: http:/