Hi Koert,
Just curious did you find any information like CANNOT FIND ADDRESS
after clicking into some stage? I've seen similar problems due to lost of
executors.
Best,
On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote:
I just tested a long lived application (that we
Thank you, Tathagata. That explains.
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Fri, Jul 11, 2014 at 7:21 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Task slot is equivalent to core number. So one core can only run one task
at a time.
TD
On Fri, Jul 11, 2014 at 1:57
Thanks, setting the number of partitions to the number of executors helped a
lot and training with 20k entries got a lot faster.
However, when I tried training with 1M entries, after about 45 minutes of
calculations, I get this:
It's stuck at this point. The CPU load for the master is at 100%
Hi All,
Congrats to the entire Spark team on the 1.0.1 release.
In checking out the new features, I noticed that it looks like the
python API docs have been updated, but the title and the header at
the top of the page still say Spark 1.0.0. Clearly not a big
deal... I just wouldn't want anyone
Thanks TD.
BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:
* a single input (250 GB) (in this case is there any max upper bound)
or
* split into 1000 files each of 250 MB (hdfs block size is 250 MB) or
* a multiple of hdfs block size.
I am run logistic regression with SGD on a problem with about 19M
parameters (the kdda dataset from the libsvm library)
I consistently see that the nodes on my computer get disconnected and
soon the whole job goes to a grinding halt.
14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler:
Are you using 1.0 or current master? A bug related to this is fixed in
master.
On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote:
I am run logistic regression with SGD on a problem with about 19M
parameters (the kdda dataset from the libsvm library)
I consistently see that
Hi,
I am trying to run the example BinaryClassification
(org.apache.spark.examples.mllib.BinaryClassification) on a 202G file. I am
constantly getting the messages looks like below, it is normal or I am
missing something.
14/07/12 09:49:04 WARN BlockManager: Block rdd_4_196 could not be dropped
Hi:
I have trouble understanding the default partitioner (hash) in Spark.
Suppose that an RDD with two partitions is created as follows:
x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2)
Does spark partition x based on the hash of the key (e.g., a, b, c) by
default?
(1) Assuming this is
hey shuo,
so far all stage links work fine for me.
i did some more testing, and it seems kind of random what shows up on the
gui and what does not. some partially cached RDDs make it to the GUI, while
some fully cached ones do not. I have not been able to detect a pattern.
is the codebase for
Hi All,
Thanks to Jey's help, I have a release AMI candidate for
spark-1.0/anaconda-2.0 integration. It's currently limited to availability
in US-EAST: ami-3ecd0c56
Give it a try if you have some time. This should* just work* with spark
1.0:
./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a
Hello,
I ran SparkPageRank example (the one included in the package) to evaluate
scale-in capability of Spark. I ran experiments on a 8-node 48-core
AMD machine with local[N] master. But, for N 10, the completion time
of the experiment kept increasing, rather than decreasing.
When I
From the interactive shell I’ve created a StreamingContext.
I call ssc.start() and take a look at http://master_url:4040/streaming/ and
see that I have an active Twitter receiver. Then I call
ssc.stop(stopSparkContext
= false, stopGracefully = true) and wait a bit, but the receiver seems to
stay
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver,
will fix asap.
Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464
TD
On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
To add a potentially relevant piece of
Okie doke. Thanks for filing the JIRA.
On Sat, Jul 12, 2014 at 6:45 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Yes, thats a bug i just discovered. Race condition in the Twitter
Receiver, will fix asap.
Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464
TD
On
Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.
(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the
I am using the master that I compiled 2 days ago. Can you point me to the JIRA?
On Sat, Jul 12, 2014 at 9:13 AM, DB Tsai dbt...@dbtsai.com wrote:
Are you using 1.0 or current master? A bug related to this is fixed in
master.
On Jul 12, 2014 8:50 AM, Srikrishna S srikrishna...@gmail.com wrote:
The netlib.BLAS: Failed to load implementation warning only means that
the BLAS implementation may be slower than using a native one. The reason
why it only shows up at the end is that the library is only used for the
finalization step of the KMeans algorithm, so your job should've been
wrapping
If you don't really care about the batchedDegree, but rather just want to
do operations over some set of elements rather than one at a time, then
just use mapPartitions().
Otherwise, if you really do want certain sized batches and you are able to
relax the constraints slightly, is to construct
I think this is probably dying on the driver itself, as you are probably
materializing the whole dataset inside your python driver. How large is
spark_data_array compared to your driver memory?
On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote:
I put the same dataset into
https://issues.apache.org/jira/browse/SPARK-2156
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S srikrishna...@gmail.com wrote:
I am using the
Hi Xiangrui,
Thanks for the information. Also, it is possible to figure out the execution
time per iteration for SVM?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Putting-block-rdd-failed-when-running-example-svm-on-large-data-tp9515p9535.html
Sent
Also check the web ui for that. Each iteration will have one or more stages
associated with it in the driver web ui.
On Sat, Jul 12, 2014 at 6:47 PM, crater cq...@ucmerced.edu wrote:
Hi Xiangrui,
Thanks for the information. Also, it is possible to figure out the
execution
time per
Is there a place where we can find an up-to-date list of supported SQL
syntax in Spark SQL?
Nick
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-tp9538.html
Sent from the Apache Spark User List mailing list archive at
I'm working of a patch to MLLib that allows for multiplexing several
different model optimization using the same RDD ( SPARK-2372:
https://issues.apache.org/jira/browse/SPARK-2372 )
In testing larger datasets, I've started to see some memory errors (
java.lang.OutOfMemoryError and exceeds max
I also did a quick glance through the code and couldn't find anything
worrying that should be included in the task closures. The only possibly
unsanitary part is the Updater you pass in -- what is your Updater and is
it possible it's dragging in a significant amount of extra state?
On Sat, Jul
Thanks!
I thought it would get passed through netcat, but given your email, I was
able to follow this tutorial and get it to work:
http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html
On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen so...@cloudera.com wrote:
netcat is
And if you can relax your constraints even further to only require
RDD[List[Int]], then it's even simpler:
rdd.mapPartitions(_.grouped(batchedDegree))
On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson ilike...@gmail.com wrote:
If you don't really care about the batchedDegree, but rather just
The UI code is the same in both, but one possibility is that your executors
were given less memory on YARN. Can you check that? Or otherwise, how do you
know that some RDDs were cached?
Matei
On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote:
hey shuo,
so far all stage
29 matches
Mail list logo