Re: KMeans code is rubbish

2014-07-14 Thread Wanda Hawk
The problem is that I get the same results every time On Friday, July 11, 2014 7:22 PM, Ameet Talwalkar atalwal...@gmail.com wrote: Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this

Re: mapPartitionsWithIndex

2014-07-14 Thread Xiangrui Meng
You should return an iterator in mapPartitionsWIthIndex. This is from the programming guide (http://spark.apache.org/docs/latest/programming-guide.html): mapPartitionsWithIndex(func): Similar to mapPartitions, but also provides func with an integer value representing the index of the partition,

Error when testing with large sparse svm

2014-07-14 Thread crater
Hi, I encounter an error when testing svm (example one) on very large sparse data. The dataset I ran on was a toy dataset with only ten examples but 13 million sparse vector with a few thousands non-zero entries. The errors is showing below. I am wondering is this a bug or I am missing

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Yanbo Liang
Make Catalyst independent of Spark is the goal of Catalyst, maybe need time and evolution. I awared that package org.apache.spark.sql.catalyst.util embraced org.apache.spark.util.{Utils = SparkUtils}, so that Catalyst has a dependency on Spark core. I'm not sure whether it will be replaced by

Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Martin Gammelsæter
I am very interested in the original question as well, is there any list (even if it is simply in the code) of all supported syntax for Spark SQL? On Mon, Jul 14, 2014 at 6:41 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you sure the code running on the cluster has been updated?

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the serialized weight vector. There is a JIRA about switching automatically between sending through akka or broadcast: https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui On Mon, Jul 14, 2014 at 12:15 AM, crater

Re: Graphx traversal and merge interesting edges

2014-07-14 Thread HHB
Hi Ankur, FYI - in a naive attempt to enhance your solution, managed to create MergePatternPath. I think it works in expected way (atleast for the traversing problem in last email). I modified your code a bit. Also instead of EdgePattern I used List of Functions that match the whole edge

Re: mapPartitionsWithIndex

2014-07-14 Thread Madhura
It worked! I was struggling for a week. Thanks a lot! On Mon, Jul 14, 2014 at 12:31 PM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n9591...@n3.nabble.com wrote: You should return an iterator in mapPartitionsWIthIndex. This is from the programming guide

spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread victor sheng
Hi, I encountered a weird problem in spark sql. I use sbt/sbt hive/console to go into the shell. I test the filter push down by using catalyst. scala val queryPlan = sql(select value from (select key,value from src)a where a.key=86 ) scala queryPlan.baseLogicalPlan res0:

sbt + idea + test

2014-07-14 Thread boci
Hi guys, I want to use Elasticsearch and HBase in my spark project, I want to create a test. I pulled up ES and Zookeeper, but if I put val htest = new HBaseTestingUtility() to my app I got a strange exception (compilation time, not runtime). https://gist.github.com/b0c1/4a4b3f6350816090c3b5

Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Niek Tax
Hi everyone, Currently I am working on parallelizing a machine learning algorithm using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop MapReduce, but since my algorithm is iterative the job scheduling overhead and data loading overhead severely limits the performance of my

Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Pei-Lun Lee
Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in: https://gist.github.com/massie/7224868 When I try to submit a select query with columns of type fixed length byte array, the following error pops up: 14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed

Error in spark: Exception in thread delete Spark temp dir

2014-07-14 Thread Rahul Bhojwani
I am getting an error saying: Exception in thread delete Spark temp dir C:\Users\shawn\AppData\Local\Temp\spark-b4f1105c-d67b-488c-83f9-eff1d1b95786 java.io.IOExcept ion: Failed to delete: C:\Users\shawn\AppData\Local\Temp\spark-b4f1105c-d67b-488c-83f9-eff1d1b95786\tmppr36zu at

Re: Problem reading in LZO compressed files

2014-07-14 Thread Ognen Duzlevski
Nicholas, thanks nevertheless! I am going to spend some time to try and figure this out and report back :-) Ognen On 7/13/14, 7:05 PM, Nicholas Chammas wrote: I actually never got this to work, which is part of the reason why I filed that JIRA. Apart from using |--jar| when starting the

Can we get a spark context inside a mapper

2014-07-14 Thread Rahul Bhojwani
Hey, My question is for this situation: Suppose we have 10 files each containing list of features in each row. Task is that for each file cluster the features in that file and write the corresponding cluster along with it in a new file. So we have to generate 10 more files by applying

Re: Spark Questions

2014-07-14 Thread Gonzalo Zarza
Thanks for your answers Shuo Xiang and Aaron Davidson! Regards, -- *Gonzalo Zarza* | PhD in High-Performance Computing | Big-Data Specialist | *GLOBANT* | AR: +54 11 4109 1700 ext. 15494 | US: +1 877 215 5230 ext. 15494 | [image: Facebook] https://www.facebook.com/Globant [image: Twitter]

Re: spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread Yin Huai
Hi, queryPlan.baseLogicalPlan is not the plan used to execution. Actually, the baseLogicalPlan of a SchemaRDD (queryPlan in your case) is just the parsed plan (the parsed plan will be analyzed, and then optimized. Finally, a physical plan will be created). The plan shows up after you execute val

Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
I'm a Spark and HDInsight novice, so I could be wrong... HDInsight is based on HDP2, so my guess here is that you have the option of installing/configuring Spark in cluster mode (YARN) or in standalone mode and package the Spark binaries with your job. Everything I seem to look at is related to

Re: Announcing Spark 1.0.1

2014-07-14 Thread Philip Ogren
Hi Patrick, This is great news but I nearly missed the announcement because it had scrolled off the folder view that I have Spark users list messages go to. 40+ new threads since you sent the email out on Friday evening. You might consider having someone on your team create a

Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
Looks like going with cluster mode is not a good idea: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/ Seems like a non-HDInsight VM might be needed to make it the Spark master node. Marco On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw

RE: writing FLume data to HDFS

2014-07-14 Thread Sundaram, Muthu X.
I am not sure how to write it…I tried writing to local file system using FileWriter and Print Writer. I tried it inside the while loop. I am able to get the text and able to print it but it fails when I use regular java classes. Shouldn’t I use regular java classes here? Can I write to only

Re: Error when testing with large sparse svm

2014-07-14 Thread crater
Hi xiangrui, Where can I set the spark.akka.frameSize ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9616.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
If you use Scala, you can do: val conf = new SparkConf() .setMaster(yarn-client) .setAppName(Logistic regression SGD fixed) .set(spark.akka.frameSize, 100) .setExecutorEnv(SPARK_JAVA_OPTS, -Dspark.akka.frameSize=100) var sc = new

Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
hi I am new to spark and scala and I am trying to do some aggregations on json file stream using Spark Streaming. I am able to parse the json string and it is converted to map(id - 123, name - srini, mobile - 12324214, score - 123, test_type - math) now i want to use GROUPBY function on each

Trouble with spark-ec2 script: --ebs-vol-size

2014-07-14 Thread Ben Horner
Hello, I'm using the spark-0.9.1-bin-hadoop1 distribution, and the ec2/spark-ec2 script within it to spin up a cluster. I tried running my processing just using the default (ephemeral) HDFS configuration, but my job errored out, saying that there was no space left. So now I'm trying to increase

Re: Supported SQL syntax in Spark SQL

2014-07-14 Thread Michael Armbrust
You can find the parser here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala In general the hive parser provided by HQL is much more complete at the moment. Long term we will likely stop using parser combinators and either

Gradient Boosted Machines

2014-07-14 Thread Daniel Bendavid
Hi, My company is strongly considering implementing a recommendation engine that is built off of statistical models using Spark. We attended the Spark Summit and were incredibly impressed with the technology and the entire community. Since then, we have been exploring the technology and

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Rahul Bhojwani
I understand that the question is very unprofessional, but I am a newbie. If you could share some link where I can ask such questions, if not here. But please answer. On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Hey, My question is for this situation:

Re: Error when testing with large sparse svm

2014-07-14 Thread crater
Hi Krishna, Thanks for your help. Are you able to get your 29M data running yet? I fix the previous problem by setting larger spark.akka.frameSize, but now I get some other errors below. Did you get these errors before? 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Daniel Siegmann
Rahul, I'm not sure what you mean by your question being very unprofessional. You can feel free to answer such questions here. You may or may not receive an answer, and you shouldn't necessarily expect to have your question answered within five hours. I've never tried to do anything like your

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Matei Zaharia
You currently can't use SparkContext inside a Spark task, so in this case you'd have to call some kind of local K-means library. One example you can try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text files as an RDD of strings with SparkContext.wholeTextFiles

Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
That is exactly the same error that I got. I am still having no success. Regards, Krishna On Mon, Jul 14, 2014 at 11:50 AM, crater cq...@ucmerced.edu wrote: Hi Krishna, Thanks for your help. Are you able to get your 29M data running yet? I fix the previous problem by setting larger

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Jerry Lam
Hi there, I think the question is interesting; a spark of sparks = spark I wonder if you can use the spark job server ( https://github.com/ooyala/spark-jobserver)? So in the spark task that requires a new spark context, instead of creating it in the task, contact the job server to create one and

Re: writing FLume data to HDFS

2014-07-14 Thread Tathagata Das
Stepping a bit back, if you just want to write flume data to HDFS, you can use flume's HDFS sink for that. Trying to do this using Spark Streaming and SparkFlumeEvent is unnecessarily complex. And I guess it is tricky to write the raw bytes from the sparkflumevent into a file. If you want to do

Re: Ideal core count within a single JVM

2014-07-14 Thread lokesh.gidra
Thanks a lot for replying back. Actually, I am running the SparkPageRank example with 160GB heap (I am sure the problem is not GC because the excess time is being spent in java code only). What I have observed in Jprofiler and Oprofile outputs is that the amount of time spent in following 2

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
You have to import StreamingContext._ to enable groupByKey operations on DStreams. After importing that you can apply groupByKey on any DStream, that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The data in each pair RDDs will be grouped by the first element in the tuple as the

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-14 Thread Tathagata Das
Seems like it is related. Possibly those PRs that Andrew mentioned are going to fix this issue. On Fri, Jul 11, 2014 at 5:51 AM, Haopu Wang hw...@qilinsoft.com wrote: I saw some exceptions like this in driver log. Can you shed some lights? Is it related with the behaviour? 14/07/11

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-14 Thread Tathagata Das
The depends on your requirements. If you want to process the 250 GB input file as a stream to emulate the stream of data, then it should be split into files (such that event ordering is maintained in those splits, if necessary). And then those splits should be moved one-by-one in the directory

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
Are you increasing the number of parallel tasks with cores as well? With more tasks there will be more data communicated and hence more calls to these functions. Unfortunately contention is kind of hard to measure, since often the result is that you see many cores idle as they're waiting on a

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-14 Thread Tathagata Das
When you are sending data using simple socket code to send messages, are those messages \n delimited? If its not, then the receiver of socketTextSTream, wont identify them as separate events, and keep buffering them. TD On Sun, Jul 13, 2014 at 10:49 PM, kytay kaiyang@gmail.com wrote: Hi

Re: Error in JavaKafkaWordCount.java example

2014-07-14 Thread Tathagata Das
Are you compiling it within Spark using Spark's recommended way (see doc web page)? Or are you compiling it in your own project? In the latter case, make sure you are using the Scala 2.10.4. TD On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed mahebub...@gmail.com wrote: Hello, I am referring

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Michael Armbrust
This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in:

Re: can't print DStream after reduce

2014-07-14 Thread Tathagata Das
The problem is not really for local[1] or local. The problem arises when there are more input streams than there are cores. But I agree, for people who are just beginning to use it by running it locally, there should be a check addressing this. I made a JIRA for this.

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
What sort of nested query are you talking about? Right now we only support nested queries in the FROM clause. I'd like to add support for other cases in the future. On Sun, Jul 13, 2014 at 4:11 AM, anyweil wei...@gmail.com wrote: Or is it supported? I know I could doing it myself with

Memory compute-intensive tasks

2014-07-14 Thread Ravi Pandya
I'm trying to run a job that includes an invocation of a memory compute-intensive multithreaded C++ program, and so I'd like to run one task per physical node. Using rdd.coalesce(# nodes) seems to just allocate one task per core, and so runs out of memory on the node. Is there any way to give the

Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Josh Happoldt
Hi All, I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on EC2. It appears that the spark-submit script is not bundled with a spark-ec2 install. Given that: What is the recommended way to execute spark jobs on a standalone EC2 cluster? Spark-submit provides

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Handling of complex types is somewhat limited in SQL at the moment. It'll be more complete if you use HiveQL. That said, the problem here is you are calling .name on an array. You need to pick an item from the array (using [..]) or use something like a lateral view explode. On Sat, Jul 12,

Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Aris Vlasakakis
Hello Spark community, I would like to write an application in Scala that i a model server. It should have an MLlib Linear Regression model that is already trained on some big set of data, and then is able to repeatedly call myLinearRegressionModel.predict() many times and return the result.

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Soumya Simanta
Please look at the following. https://github.com/ooyala/spark-jobserver http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language https://github.com/EsotericSoftware/kryo You can train your model convert it to PMML and return that to your client OR You can train your model and write that

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
I don't have a solution for you (sorry), but do note that rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set shuffle=true then it should repartition and redistribute the data. But it uses the hash partitioner according to the ScalaDoc - I don't know of any way to supply a

How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
Hi all, A newbie question, I start a spark yarn application through spark-submit How do I kill this app. I can kill the yarn app by yarn application -kill appid but the application master is still running. What's the proper way to shutdown the entire app? Best, Siyuan

Re: Number of executors change during job running

2014-07-14 Thread Bill Jay
Hi Tathagata, It seems repartition does not necessarily force Spark to distribute the data into different executors. I have launched a new job which uses repartition right after I received data from Kafka. For the first two batches, the reduce stage used more than 80 executors. Starting from the

Re: Memory compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster,

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Matei Zaharia
The script should be there, in the spark/bin directory. What command did you use to launch the cluster? Matei On Jul 14, 2014, at 1:12 PM, Josh Happoldt josh.happo...@trueffect.com wrote: Hi All, I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on EC2. It

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
Depending on how your C++ program is designed, maybe you can feed the data from multiple partitions into the same process? Getting the results back might be tricky. But that may be the only way to guarantee you're only using one invocation per node. On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-14 Thread Mohit Jaggi
Continuing to debug with Scala, I tried this on local with enough memory (10g) and it is able to count the dataset. With more memory(for executor and driver) in a cluster it still fails. The data is about 2Gbytes. It is 30k * 4k doubles. On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson

Re: Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
Hi, Thanks for ur reply...i imported StreamingContext and right now i am getting my Dstream as something like map(id - 123, name - srini, mobile - 12324214, score - 123, test_type - math) map(id - 321, name - vasu, mobile - 73942090, score - 324, test_type -sci) map(id - 432, name -,

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I restarted Spark Master with spark-0.9.1 and SparkR was able to communicate with the Master. I am using the latest SparkR pkg-e1f95b6. Maybe it has problem communicating to Spark 1.0.0? -- View this message in context:

Re: Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
Hi, Thanks for ur reply...i imported StreamingContext and right now i am getting my Dstream as something like map(id - 123, name - srini, mobile - 12324214, score - 123, test_type - math) map(id - 321, name - vasu, mobile - 73942090, score - 324, test_type -sci) map(id - 432, name -,

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
Can you give me a screen shot of the stages page in the web ui, the spark logs, and the code that is causing this behavior. This seems quite weird to me. TD On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tathagata, It seems repartition does not necessarily

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
In general it may be a better idea to actually convert the records from hashmaps, to a specific data structure. Say case class Record(id: int, name: String, mobile: String, score: Int, test_type: String ... ) Then you should be able to do something like val records = jsonf.map(m =

import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
I'm using spark 1.0.0 (three weeks old build of latest). Along the lines of this tutorial http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html , I want to read some tweets from twitter. When trying to execute in the Spark-Shell, I get The tutorial

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
The twitter functionality is not available through the shell. 1) we separated these non-core functionality into separate subprojects so that their dependencies do not collide/pollute those of of core spark 2) a shell is not really the best way to start a long running stream. Its best to use

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

SQL + streaming

2014-07-14 Thread hsy...@gmail.com
Hi All, Couple days ago, I tried to integrate SQL and streaming together. My understanding is I can transform RDD from Dstream to schemaRDD and execute SQL on each RDD. But I got no luck Would you guys help me take a look at my code? Thank you very much! object KafkaSpark { def main(args:

Re: Stateful RDDs?

2014-07-14 Thread Tathagata Das
Trying answer your questions as concisely as possible 1. In the current implementation, the entire state RDD needs to loaded for any update. It is a known limitation, that we want to overcome in the future. Therefore the state Dstream should not be persisted to disk as all the data in the state

Spark-Streaming collect/take functionality.

2014-07-14 Thread jon.burns
Hello everyone, I'm an undergrad working on a summarization project. I've created a summarizer in normal Spark and it works great, however I want to write it for Spark_Streaming to increase it's functionality. Basically I take in a bunch of text and get the most popular words as well as most

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Then yarn application -kill appid should work. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote: yarn-cluster On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com

Re: How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
Before yarn application -kill If you do jps You'll have a list of SparkSubmit and ApplicationMaster After you use yarn applicaton -kill you only kill the SparkSubmit On Mon, Jul 14, 2014 at 4:29 PM, Jerry Lam chiling...@gmail.com wrote: Then yarn application -kill appid should work. This is

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but SQL command throwing error? No errors but no output either? TD On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com

Re: Spark-Streaming collect/take functionality.

2014-07-14 Thread Tathagata Das
Why doesnt something like this work? If you want a continuously updated reference to the top counts, you can use a global variable. var topCounts: Array[(String, Int)] = null sortedCounts.foreachRDD (rdd = val currentTopCounts = rdd.take(10) // print currentTopCounts it or watever

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Tathagata Das
Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul

Re: SQL + streaming

2014-07-14 Thread hsy...@gmail.com
No errors but no output either... Thanks! On Mon, Jul 14, 2014 at 4:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
Thanks. Can I see that a Class is not available in the shell somewhere in the API Docs or do I have to find out by trial and error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html Sent from

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
Is it on a standalone server? There are several settings worthing checking: 1) number of partitions, which should match the number of cores 2) driver memory (you can see it from the executor tab of the Spark WebUI and set it with --driver-memory 10g 3) the version of Spark you were running Best,

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I tried installing the latest Spark 1.0.1 and SparkR couldn't find the master either. I restarted with Spark 0.9.1 and SparkR was able to find the master. So, there seemed to be something that changed after Spark 1.0.0. -- View this message in context:

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Can you make sure you are running locally on more than 1 local cores? You could set the master in the SparkConf as conf.setMaster(local[4]). Then see if there are jobs running on every batch of data in the Spark web ui (running on localhost:4040). If you still dont get any output, try first simple

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
I guess this is not clearly documented. At a high level, any class that is in the package org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume, zeromq, mqtt } is not available in the Spark shell. I have added this to the larger JIRA of things-to-add-to-streaming-docs

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The twitter functionality is not available through the shell. I've been processing Tweets live from the shell, though not for a long time. That's how I uncovered the problem with the Twitter receiver not

Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
I am running Spark 1.0.1 on a 5 node yarn cluster. I have set the driver memory to 8G and executor memory to about 12G. Regards, Krishna On Mon, Jul 14, 2014 at 5:56 PM, Xiangrui Meng men...@gmail.com wrote: Is it on a standalone server? There are several settings worthing checking: 1)

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
BTW you can see the number of parallel tasks in the application UI (http://localhost:4040) or in the log messages (e.g. when it says progress: 17/20, that means there are 20 tasks total and 17 are done). Spark will try to use at least one task per core in local mode so there might be more of

SPARK_WORKER_PORT (standalone cluster)

2014-07-14 Thread jay vyas
Hi spark ! What is the purpose of the randomly assigned SPARK_WORKER_PORT from the documentation it sais to join a cluster, but its not clear to me how a random port could be used to communicate with other members of a spark pool. This question might be grounded in my ignorance ... if so

jsonRDD: NoSuchMethodError

2014-07-14 Thread SK
Hi, I am using Spark 1.0.1. I am using the following piece of code to parse a json file. It is based on the code snippet in the SparkSQL programming guide. However, the compiler outputs an error stating: java.lang.NoSuchMethodError:

Re: spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread Victor Sheng
I use queryPlan.queryExecution.analyzed to get the logical plan. it works. And What you explained to me is very useful. Thank you very much. -- View this message in context:

Re: ---cores option in spark-shell

2014-07-14 Thread cjwang
Neither do they work in new 1.0.1 either -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cores-option-in-spark-shell-tp6809p9690.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: jsonRDD: NoSuchMethodError

2014-07-14 Thread Michael Armbrust
Have you upgraded the cluster where you are running this 1.0.1 as well? A NoSuchMethodError almost always means that the class files available at runtime are different from those that were there when you compiled your program. On Mon, Jul 14, 2014 at 7:06 PM, SK skrishna...@gmail.com wrote:

Re: ---cores option in spark-shell

2014-07-14 Thread Andrew Or
Yes, the documentation is actually a little outdated. We will get around to fix it shortly. Please use --driver-cores or --executor-cores instead. 2014-07-14 19:10 GMT-07:00 cjwang c...@cjwang.us: Neither do they work in new 1.0.1 either -- View this message in context:

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
Did you make any updates in Spark version recently, after which you noticed this problem? Because if you were using Spark 0.8 and below, then twitter would have worked in the Spark shell. In Spark 0.9, we moved those dependencies out of the core spark for those to update more freely without

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
If we're talking about the issue you captured in SPARK-2464 https://issues.apache.org/jira/browse/SPARK-2464, then it was a newly launched EC2 cluster on 1.0.1. On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Did you make any updates in Spark version

RACK_LOCAL Tasks Failed to finish

2014-07-14 Thread 洪奇
Hi all,When running GraphX applications on Spark, task scheduler may schedule some tasks to be executed on RACK_LOCAL executors,but the tasks get halting in that case, repeating print the following log information: 14-07-14 15:59:14 INFO [Executor task launch worker-6]

Re: Announcing Spark 1.0.1

2014-07-14 Thread Tobias Pfeiffer
Hi, congratulations on the release! I'm always pleased to see how features pop up in new Spark versions that I had added for myself in a very hackish way before (such as JSON support for Spark SQL). I am wondering if there is any good way to learn early about what is going to be in upcoming

branch-1.0-jdbc on EC2?

2014-07-14 Thread billk
I'm wondering if anyone has had success with an EC2 deployment of the https://github.com/apache/spark/tree/branch-1.0-jdbc https://github.com/apache/spark/tree/branch-1.0-jdbc branch that Michael Armbrust referenced in his Unified Data Access with Spark SQL

Re: hdfs replication on saving RDD

2014-07-14 Thread valgrind_girl
eager to know this issue too,does any one knows how? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: hdfs replication on saving RDD

2014-07-14 Thread Matei Zaharia
You can change this setting through SparkContext.hadoopConfiguration, or put the conf/ directory of your Hadoop installation on the CLASSPATH when you launch your app so that it reads the config values from there. Matei On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote: eager

truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-14 Thread Walrus theCat
Hi, I've got a socketTextStream through which I'm reading input. I have three Dstreams, all of which are the same window operation over that socketTextStream. I have a four core machine. As we've been covering lately, I have to give a cores parameter to my StreamingSparkContext: ssc = new

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
Oh right, that could have happened only after Spark 1.0.0. So let me clarify. At some point, you were able to access TwitterUtils from spark shell using Spark 1.0.0+ ? If yes, then what change in Spark caused it to not work any more? TD On Mon, Jul 14, 2014 at 7:52 PM, Nicholas Chammas

Re: Error when testing with large sparse svm

2014-07-14 Thread crater
(1) What is number of partitions? Is it number of workers per node? (2) I already set the driver memory pretty big, which is 25g. (3) I am running Spark 1.0.1 in standalone cluster with 9 nodes, 1 one them works as master, others are workers. -- View this message in context:

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Madabhattula Rajesh Kumar
Hi Team, Is this issue with JavaStreamingContext.textFileStream(hdfsfolderpath) API also? Please conform. If yes, could you please help me to fix this issue. I'm using spark 1.0.0 version. Regards, Rajesh On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Oh

  1   2   >