Ideally you should be converting RDD to schemardd ?
You are creating UnionRDD to join across dstream rdd?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jul 1, 2014 at 3:11 PM, Honey Joshi
I want to make some minor modifications in the SparkMeans.scala so running the
basic example won't do.
I have also packed my code under a jar file with sbt. It completes
successfully but when I try to run it : java -jar myjar.jar I get the same
error:
Exception in thread main
On Wed, July 2, 2014 1:11 am, Mayur Rustagi wrote:
Ideally you should be converting RDD to schemardd ?
You are creating UnionRDD to join across dstream rdd?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue,
You may be able to mix StreamingListener SparkListener to get meaningful
information about your task. however you need to connect a lot of pieces to
make sense of the flow..
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On
A lot of RDD that you create in Code may not even be constructed as the
tasks layer is optimized in the DAG scheduler.. The closest is onUnpersistRDD
in SparkListner.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon,
Got it ! Ran the jar with spark-submit. Thanks !
On Wednesday, July 2, 2014 9:16 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
I want to make some minor modifications in the SparkMeans.scala so running the
basic example won't do.
I have also packed my code under a jar file with sbt. It
two job context cannot share data, are you collecting the data to the
master then sending it to the other context?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jul 2, 2014 at 11:57 AM, Honey Joshi
Hi all,
I need to run a complex external process with a lots of dependencies from
spark. The pipe and addFile function seem to be my friends but there
are just some issues that I need to solve.
Precisely, the process I want to run are C++ executable that may depend on
some libraries and
Your executors are going out of memory then subsequent tasks scheduled on
the scheduler are also failing, hence the lost tid(task id).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 30, 2014 at 7:47 PM, Sguj
Hi,
When running a Spark job, the following warning message displays and the job
seems no longer progressing.
(Detailed log message are at the bottom of this message.)
---
14/07/02 17:00:14 WARN AbstractNioSelector: Unexpected exception in the
selector loop.
Hi, everyone!
Is it possible to run HiveThriftServer2 based on SparkSQL in YARN now?
Spark version: branch 1.0-jdbc
YARN version: 2.3.0-cdh5.0.0
Also, the machine on which the driver program runs constantly uses about
7~8% of 100Mbps network connection.
Is the driver program involved in the reduceByKey() somehow?
BTW, currently an accumulator is used, but the network usage does not drop
even when accumulator is removed.
Thanks in advance.
It seems that the driver program gets out of memory.
In Windows Task Manager, the driver program's memory constantly grows until
around 3,434,796, then java OutOfMemory exception occurs.
(BTW, the driver program runs on Windows 7 64bit machine, and cluster is on
CentOS.)
Why the memory of the
The problem is resolved.I have added SPARK_LOCAL_IP=master in both slaves
also.When i changed this my slaves are working.
Thank you all for your suggestions
Thanks Regards,
Meethu M
On Wednesday, 2 July 2014 10:22 AM, Aaron Davidson ilike...@gmail.com wrote:
In your spark-env.sh, do you
You can try setting your HTTP_PROXY environment variable.
export HTTP_PROXY=host:port
But I don't use maven. If the env variable doesn't work, please search google
for maven proxy. I am sure there will be a lot of related results.
Sent from my iPhone
On 2014年7月2日, at 19:04, Stuti Awasthi
Hi!
Total Scala and Spark noob here with a few questions.
I am trying to modify a few of the examples in the spark repo to fit
my needs, but running into a few problems.
I am making an RDD from Cassandra, which I've finally gotten to work,
and trying to do some operations on it. Specifically I
I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with
spark-0.8.0 with this line in bash.rc export _JAVA_OPTIONS=-Xmx15g -Xms15g
-verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails . It finished in a
decent time, ~50 seconds, and I had only a few Full GC messages
Hi,
I am running my spark job on Yarn; using latest code from master
branch..synced few days back. I see this IO Exception during shuffle(in
resource manager logs). What could be wrong and how to debug it? I have seen
this few times before; I was suspecting that this could side effect of
memory
Hi,
i have a non-serializable class and as workaround i'm trying to
re-instantiate it at each de-serialization. Thus, i created a wrapper class
and overridden the writeObject and readObject methods as follow:
private def writeObject(oos: ObjectOutputStream) {
oos.defaultWriteObject()
Can you elaborate why You need to configure the spark.shuffle.spill
true again in the config -- the default for spark.shuffle.spill is
set to true according to the
doc(https://spark.apache.org/docs/0.9.1/configuration.html)?
On OOM the tasks were process_local, which I understand is as good as
SparkContext is not serializable and can't be just sent across ;)
2014-06-21 14:14 GMT+02:00 Mayur Rustagi mayur.rust...@gmail.com:
You can terminate job group from spark context, Youll have to send across
the spark context to your task.
On 21 Jun 2014 01:09, Piotr Kołaczkowski
The scripts that Xiangrui mentions set up the classpath...Can you run
./run-example for the provided example sucessfully?
What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call
run-example -- that will show you the exact java command used to run
the example at the start of execution.
Hi list,
I'm benchmarking MLlib for a regression task [1] and get strange results.
Namely, using RidgeRegressionWithSGD it seems the predicted points miss the
intercept:
{code}
val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
...
valuesAndPreds.take(10).map(t = println(t))
Hi Xiangrui,
The issue with aggergating/counting over large feature vectors (as part of
LogisticRegressionWithSGD) continues to exist, but now in another form:
while the execution doesn't freeze (due to SPARK-1112), it now fails at the
second or third gradient descent iteration consistently with
Tathagata Das wrote
*Answer 1:*Make sure you are using master as local[n] with n 1
(assuming you are running it in local mode). The way Spark Streaming
works is that it assigns a code to the data receiver, and so if you
run the program with only one core (i.e., with local or local[1]),
then
Hi all,
I'm trying to run some transformation on *Spark*, it works fine on cluster
(YARN, linux machines). However, when I'm trying to run it on local machine
(*Windows 7*) under unit test, I got errors:
java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop
Hi,
I'm trying to install spark 1 on my hadoop cluster running on EMR. I
didn't have any problem installing the previous versions, but on this
version I couldn't find any 'sbt' folder. However, the README still
suggests using this to install Spark:
./sbt/sbt assembly
which fails:
./sbt/sbt:
I can run it now with the suggested method. However, I have encountered a new
problem that I have not faced before (sent another email with that one but here
it goes again ...)
I ran SparkKMeans with a big file (~ 7 GB of data) for one iteration with
spark-0.8.0 with this line in bash.rc
Hi, I'm using Spark 1.1.0 standalone with 5 workers and 1 driver, and Kryo
settings are
When I submit this job, the driver works fine but workers will throw
ClassNotFoundException saying they can not found HDTMKryoRegistrator.
Any idea about this problem? I googled this but there is only one
Don’t know why the setting does not appear in the last mail:
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
.set(spark.kryo.registrator, new HDTMKryoRegistrator().getClass.getName)
On Jul 2, 2014, at 1:03 PM, dash b...@nd.edu wrote:
Hi, I'm using Spark 1.1.0
Thanks Mayur. I will take a look at StreamingListener.
Is there any example you have handy?
On Wed, Jul 2, 2014 at 2:32 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
You may be able to mix StreamingListener SparkListener to get meaningful
information about your task. however you need to
Hello Spark fans,
I am unable to figure out how Spark figures out which logger to use. I know
that Spark decides upon this at the time of initialization of the Spark
Context. From Spark documentation it is clear that Spark uses log4j, and
not slf4j, but I have been able to successfully get spark
Hi Konstatin,
We use hadoop as a library in a few places in Spark. I wonder why the path
includes null though.
Could you provide the full stack trace?
Andrew
2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com:
Hi all,
I'm trying to run some transformation
Hi Andrew,
it's windows 7 and I doesn't set up any env variables here
The full stack trace:
14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils
Hi all,
I catch very confusing exception running Spark 1.0 on HDP2.1
During save rdd as text file I got:
14/07/02 10:11:12 WARN TaskSetManager: Loss was due to
java.lang.NullPointerException
java.lang.NullPointerException
at
Hi Yana,
In 0.9.1, spark.shuffle.spill is set to true by default so you shouldn't
need to manually set it.
Here are a few common causes of OOMs in Spark:
- Too few partitions: if one partition is too big, it may cause an OOM if
there is not enough space to unroll the entire partition in memory.
By any chance do you have HDP 2.1 installed? you may need to install the utils
and update the env variables per
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows
On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Try looking at the running processes with “ps” to see their full command line
and see whether any options are different. It seems like in both cases, your
young generation is quite large (11 GB), which doesn’t make lot of sense with a
heap of 15 GB. But maybe I’m misreading something.
Matei
No, I don’t
why do I need to have HDP installed? I don’t use Hadoop at all and I’d like to
read data from local filesystem
On Jul 2, 2014, at 9:10 PM, Denny Lee denny.g@gmail.com wrote:
By any chance do you have HDP 2.1 installed? you may need to install the
utils and update the env
You don't actually need it per se - its just that some of the Spark
libraries are referencing Hadoop libraries even if they ultimately do not
call them. When I was doing some early builds of Spark on Windows, I
admittedly had Hadoop on Windows running as well and had not run into this
particular
Hi,
in many SQL-DBMS like MySQL, you can set an offset for the LIMIT clause,
s.t. /LIMIT 5, 10/ will return 10 rows, starting from row 5.
As far as I can see, this is not possible in Spark-SQL.
The best solution I have to imitate that (using Scala) is converting the RDD
into an Array via
I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of
code working with internals of MLLib. One of the big changes was the move
from the old jblas.Matrix to the Vector/Matrix classes included in MLLib.
However I don't see how we're supposed to use them for ANYTHING other than
a
i did the second option: re-implemented .toBreeze as .breeze using pimp
classes
On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com
wrote:
I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of
code working with internals of MLLib. One of the big
Thanks. I always hate having to do stuff like this. It seems like they went
a bit overboard with all the private[mllib] declarations... possibly all
in the name of thou shalt not change your public API. If you don't make
your public API usable, we end up having to work around it anyway...
Oh
in my humble opinion Spark should've supported linalg a-la [1] before it
even started dumping methodologies into mllib.
[1] http://mahout.apache.org/users/sparkbindings/home.html
On Wed, Jul 2, 2014 at 2:16 PM, Thunder Stumpges thunder.stump...@gmail.com
wrote:
Thanks. I always hate having
HI,
I would like to set up streaming from Kafka cluster, reading multiple topics
and then processing each of the differently.
So, I’d create a stream
val stream = KafkaUtils.createStream(ssc,localhost:2181,logs,
Map(retarget - 2,datapair - 2))
And then based on whether it’s “retarget”
Hi all,
I have a problem of using Spark Streaming to accept input data and update a
result.
The input of the data is from Kafka and the output is to report a map which
is updated by historical data in every minute. My current method is to set
batch size as 1 minute and use foreachRDD to update
Hi,
Below code throws compilation error , not found: *value Sum* . Can
someone help me on this. Do i need to add any jars or imports ? even for
Count , same error is thrown
val queryResult = sql(select * from Table)
queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(*Sum*
Hi,
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E
This talks about Shark backend will be replaced with Spark SQL engine in
future.
Does that mean Spark will continue to support Shark + Spark SQL for long
term? OR
After some
Yes, they have announced that Shark is no longer under development and be
replaced with Spark SQL in Spark Summit 2014.
Chester
On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote:
Hi,
As of the spark summit 2014 they mentioned that there will be no active
development on shark.
Thanks,
Shrikar
On Wed, Jul 2, 2014 at 3:53 PM, Subacini B subac...@gmail.com wrote:
Hi,
Hi all, I recently just picked up Spark and am trying to work through a
coding issue that involves the reduceByKey method. After various debugging
efforts, it seems that the reducyByKey method never gets called.
Here's my workflow, which is followed by my code and results:
My parsed data
Spark SQL in Spark 1.1 will include all the functionality in Shark; take a look
at
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html.
We decided to do this because at the end of the day, the only code left in
Shark was the JDBC / Thrift
Hi,
I am trying to write a program that take input from kafka topics, do some
process and write the output to a hbase table.
I basically followed the MatricAggregatorHBase example TD created
(https://issues.apache.org/jira/browse/SPARK-944), but the problem is that I
always get
Hello everyone,
I'm having some difficulty reading from my company's private S3 buckets.
I've got an S3 access key and secret key, and I can read the files fine from
a non-Spark Scala routine via AWScala http://github.com/seratch/AWScala
. But trying to read them with the
Hi Konstantin,
Thanks for reporting this. This happens because there are null keys in your
data. In general, Spark should not throw null pointer exceptions, so this
is a bug. I have fixed this here: https://github.com/apache/spark/pull/1288.
For now, you can workaround this by special-handling
Hello everyone,
I'm having some difficulty reading from my company's private S3 buckets.
I've got an S3 access key and secret key, and I can read the files fine from
a non-Spark Scala routine via AWScala. But trying to read them with the
SparkContext.textFiles([comma separated s3n://bucket/key
When you use hadoopConfiguration directly, I don’t think you have to replace
the “/“ with “%2f”. Have you tried it without that? Also make sure you’re not
replacing slashes in the URL itself.
Matei
On Jul 2, 2014, at 4:17 PM, Brian Gawalt bgaw...@gmail.com wrote:
Hello everyone,
I'm
Hi Christophe,
Make sure you have 3 slashes in the hdfs scheme.
e.g.
hdfs:///server_name:9000/user/user_name/spark-events
and in the spark-defaults.conf as
well.spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events
Date: Thu, 19 Jun 2014 11:18:51 +0200
From:
hi, Matei
Do you know how to run the JDBC / Thrift server on Yarn?
I did not find any suggestion in docs.
2014-07-02 16:06 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:
Spark SQL in Spark 1.1 will include all the functionality in Shark; take a
look at
Hi everyone,
Is it possible to join RDDs using composite keys? I would like to join these
two RDDs with RDD1.id1 = RDD2.id1 and RDD1.id2 RDD2.id2RDD1 (id1, id2,
scoretype1) RDD2 (id1, id2, scoretype2)
I want the result to be ResultRDD = (id1, id2, (score1, score2))
Would really appreciate if you
Spark displays job status information on port 4040 using JobProgressListener,
any one knows how to hook into this port and read the details?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8697.html
Sent from the Apache Spark User
Hi,
I am using Mesos to run my Spark tasks. I would be interested to see how
Spark distributes the tasks in the cluster (nodes, partitions) and which
nodes are more or less active and do what kind of tasks, and how long the
transfer of data and jobs takes. Is there any way to get this information
Hello,
I a newbie to Spark MLlib and ran into a curious case when following the
instruction at the page below.
http://spark.apache.org/docs/latest/mllib-naive-bayes.html
I ran a test program on my local machine using some data.
val spConfig = (new
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
On Wed, Jul 2, 2014 at 8:23 PM, x wasedax...@gmail.com wrote:
Hello,
I a newbie to Spark MLlib and ran into a curious case when following the
instruction at
Thanks for the confirm.
I will be checking it.
Regards,
xj
On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng men...@gmail.com wrote:
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
On Wed, Jul 2, 2014 at
66 matches
Mail list logo