Re: spark ui on yarn

2014-07-12 Thread Shuo Xiang
Hi Koert, Just curious did you find any information like CANNOT FIND ADDRESS after clicking into some stage? I've seen similar problems due to lost of executors. Best, On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote: I just tested a long lived application (that we

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-12 Thread Yan Fang
Thank you, Tathagata. That explains. Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Fri, Jul 11, 2014 at 7:21 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Task slot is equivalent to core number. So one core can only run one task at a time. TD On Fri, Jul 11, 2014 at 1:57

Re: KMeans for large training data

2014-07-12 Thread durin
Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster. However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100%

Re: Announcing Spark 1.0.1

2014-07-12 Thread Brad Miller
Hi All, Congrats to the entire Spark team on the 1.0.1 release. In checking out the new features, I noticed that it looks like the python API docs have been updated, but the title and the header at the top of the page still say Spark 1.0.0. Clearly not a big deal... I just wouldn't want anyone

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-12 Thread M Singh
Thanks TD. BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use: * a single input (250 GB) (in this case is there any max upper bound) or * split into 1000 files each of 250 MB (hdfs block size is 250 MB) or * a multiple of hdfs block size.

Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am run logistic regression with SGD on a problem with about 19M parameters (the kdda dataset from the libsvm library) I consistently see that the nodes on my computer get disconnected and soon the whole job goes to a grinding halt. 14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler:

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote: I am run logistic regression with SGD on a problem with about 19M parameters (the kdda dataset from the libsvm library) I consistently see that

Putting block rdd failed when running example svm on large data

2014-07-12 Thread crater
Hi, I am trying to run the example BinaryClassification (org.apache.spark.examples.mllib.BinaryClassification) on a 202G file. I am constantly getting the messages looks like below, it is normal or I am missing something. 14/07/12 09:49:04 WARN BlockManager: Block rdd_4_196 could not be dropped

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan
Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2) Does spark partition x based on the hash of the key (e.g., a, b, c) by default? (1) Assuming this is

Re: spark ui on yarn

2014-07-12 Thread Koert Kuipers
hey shuo, so far all stage links work fine for me. i did some more testing, and it seems kind of random what shows up on the gui and what does not. some partially cached RDDs make it to the GUI, while some fully cached ones do not. I have not been able to detect a pattern. is the codebase for

Re: Anaconda Spark AMI

2014-07-12 Thread Benjamin Zaitlen
Hi All, Thanks to Jey's help, I have a release AMI candidate for spark-1.0/anaconda-2.0 integration. It's currently limited to availability in US-EAST: ami-3ecd0c56 Give it a try if you have some time. This should* just work* with spark 1.0: ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a

Scalability issue in Spark with SparkPageRank example

2014-07-12 Thread lokesh.gidra
Hello, I ran SparkPageRank example (the one included in the package) to evaluate scale-in capability of Spark. I ran experiments on a 8-node 48-core AMD machine with local[N] master. But, for N 10, the completion time of the experiment kept increasing, rather than decreasing. When I

Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nick Chammas
From the interactive shell I’ve created a StreamingContext. I call ssc.start() and take a look at http://master_url:4040/streaming/ and see that I have an active Twitter receiver. Then I call ssc.stop(stopSparkContext = false, stopGracefully = true) and wait a bit, but the receiver seems to stay

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Tathagata Das
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To add a potentially relevant piece of

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nicholas Chammas
Okie doke. Thanks for filing the JIRA. On Sat, Jul 12, 2014 at 6:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On

Re: Confused by groupByKey() and the default partitioner

2014-07-12 Thread Aaron Davidson
Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here is the associated code, you'll see that it simply checks that the

Re: Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am using the master that I compiled 2 days ago. Can you point me to the JIRA? On Sat, Jul 12, 2014 at 9:13 AM, DB Tsai dbt...@dbtsai.com wrote: Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote:

Re: KMeans for large training data

2014-07-12 Thread Aaron Davidson
The netlib.BLAS: Failed to load implementation warning only means that the BLAS implementation may be slower than using a native one. The reason why it only shows up at the end is that the library is only used for the finalization step of the KMeans algorithm, so your job should've been wrapping

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Aaron Davidson
If you don't really care about the batchedDegree, but rather just want to do operations over some set of elements rather than one at a time, then just use mapPartitions(). Otherwise, if you really do want certain sized batches and you are able to relax the constraints slightly, is to construct

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-12 Thread Aaron Davidson
I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote: I put the same dataset into

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
https://issues.apache.org/jira/browse/SPARK-2156 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S srikrishna...@gmail.com wrote: I am using the

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread crater
Hi Xiangrui, Thanks for the information. Also, it is possible to figure out the execution time per iteration for SVM? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Putting-block-rdd-failed-when-running-example-svm-on-large-data-tp9515p9535.html Sent

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread Aaron Davidson
Also check the web ui for that. Each iteration will have one or more stages associated with it in the driver web ui. On Sat, Jul 12, 2014 at 6:47 PM, crater cq...@ucmerced.edu wrote: Hi Xiangrui, Thanks for the information. Also, it is possible to figure out the execution time per

Supported SQL syntax in Spark SQL

2014-07-12 Thread Nick Chammas
Is there a place where we can find an up-to-date list of supported SQL syntax in Spark SQL? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-tp9538.html Sent from the Apache Spark User List mailing list archive at

Large Task Size?

2014-07-12 Thread Kyle Ellrott
I'm working of a patch to MLLib that allows for multiplexing several different model optimization using the same RDD ( SPARK-2372: https://issues.apache.org/jira/browse/SPARK-2372 ) In testing larger datasets, I've started to see some memory errors ( java.lang.OutOfMemoryError and exceeds max

Re: Large Task Size?

2014-07-12 Thread Aaron Davidson
I also did a quick glance through the code and couldn't find anything worrying that should be included in the task closures. The only possibly unsanitary part is the Updater you pass in -- what is your Updater and is it possible it's dragging in a significant amount of extra state? On Sat, Jul

Re: not getting output from socket connection

2014-07-12 Thread Walrus theCat
Thanks! I thought it would get passed through netcat, but given your email, I was able to follow this tutorial and get it to work: http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen so...@cloudera.com wrote: netcat is

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Mark Hamstra
And if you can relax your constraints even further to only require RDD[List[Int]], then it's even simpler: rdd.mapPartitions(_.grouped(batchedDegree)) On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson ilike...@gmail.com wrote: If you don't really care about the batchedDegree, but rather just

Re: spark ui on yarn

2014-07-12 Thread Matei Zaharia
The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote: hey shuo, so far all stage