Hi,
I wanted to set up standalone cluster on windows machine. But unfortunately,
spark-master.cmd file is not available. Can someone suggest how to proceed
or is spark-1.0.0 has missed spark-master.cmd file ?
--
View this message in context:
Can you try setting SPARK_MASTER_IP in the spark-env.sh file?
Thanks
Best Regards
On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote:
Hi all,
I have one master and two slave node, I did not set any ip for spark
driver because I thought it uses its default (
Can you also paste a little bit more stacktrace?
Thanks
Best Regards
On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote:
I have the following in spark-env.sh
SPARK_MASTER_IP=master
SPARK_MASTER_port=7077
Best Regards
Hi Srikrishna
the reason to this issue is you had uploaded assembly jar to HDFS twice.
paste your command could be better diagnosis
田毅
===
橘云平台产品线
大数据产品部
亚信联创科技(中国)有限公司
手机:13910177261
电话:010-82166322
传真:010-82166617
Q Q:20057509
This is exactly what I got
Spark
Executor Command: java -cp ::
/usr/local/spark-1.0.0/conf:
/usr/local/spark-1.0.0
/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf
-XX:MaxPermSize=128m -Xms512M -Xmx512M
Hi all,
I need to run a spark job that need a set of images as input. I need
something that load these images as RDD but I just don't know how to do
that. Do some of you have any idea ?
Cheers,
Jao
You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud
quickly.
If you want to set it up on your own then these are the things that you
will need to do:
1. Make sure you have java (7) installed on all machines.
2. Install and configure spark (add all slave nodes in
Besides restarting the Master, is there any other way to clear the
Completed Applications in Master web UI?
Try this out:
JavaStreamingContext sc = new
JavaStreamingContext(...);JavaDStreamString lines =
ctx.fileStream(whatever);JavaDStreamString words = lines.flatMap(
new FlatMapFunctionString, String() {
public IterableString call(String s) {
return Arrays.asList(s.split( ));
}
});
Hi Cheney,
I haven't heard of anybody deploying non-secure YARN on top of secure HDFS.
It's conceivable that you might be able to get work, but my guess is that
you'd run into some issues. Also, without authentication on in YARN, you
could be leaving your HDFS tokens exposed, which others could
Hi Bertrand,
We've updated the document
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0
This is our working Github repo
https://github.com/sigmoidanalytics/spork/tree/spork-0.9
Feel free to open issues over here
https://github.com/sigmoidanalytics/spork/issues
There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.
On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote:
Besides restarting the Master, is there any other way to clear the
Completed Applications in Master web UI?
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by
It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of
I have one master and two slave nodes, I did not set any ip for spark driver.
My question is should I set a ip for spark driver and can I host the driver
inside the cluster in master node? if so, how to host it? will it be hosted
automatically in that node we submit the application by
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace Hive,
which is map/reduce based. It has its own distributed query engine, though
it does load data from HDFS, and is part of the hadoop ecosystem. Impala
Hi Jerry, it's all clear to me now, I will try with something like Apache
DBCP for the connection pool
Thanks a lot for your help!
2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com:
Yes, that would be the Java equivalence to use static class member, but
you should carefully
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem
to work. For now we've downgraded dropwizard to 0.6.2 which uses a
compatible version of jetty. Not optimal, but it works for now.
On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote:
We're doing similar thing to
Hey Mikhail,
I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?
- Patrick
On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov
It seems like the Initial job has not accepted any resources; shows
up for a wide variety of different errors (for example the obvious one
where you've requested more memory than is available) but also for
example in the case where the worker nodes does not have the
appropriate code on their class
Thank you for your response.
Maybe that applies to my case.
In my test case, The types of almost all of the data are either primitive
types, joda DateTime, or String.
But I'm somewhat disappointed with the speed.
At least it should not be slower than Java default serializer...
-Original
Oh well, never mind. The problem is that ResultTask's stageId is immutable
and is used to construct the Task superclass. Anyway, my solution now is to
use this.id for the rddId and to gather all rddIds using a spark listener on
stage completed to clean up for any activity registered for those
Is it possible to run a job that assigns work to every worker in the system?
My bootleg right now is to have a spark listener hear whenever a block
manager is added and to increase a split count by 1. It runs a spark job
with that split count and hopes that it will at least run on the newest
I am using Naive Bayes in MLlib .
Below I have printed log of *model.theta*. after training on train data.
You can check that it contains 9 features for 2 class classification.
print numpy.log(model.theta)
[[ 0.31618962 0.16636852 0.07200358 0.05411449 0.08542039 0.17620751
0.03711986
Hi all,
I wondered if you could help me to clarify the next situation:
in the classic example
val file = spark.textFile(hdfs://...)
val errors = file.filter(line = line.contains(ERROR))
As I understand, the data is read in memory in first, and after that
filtering is applying. Is it any way to
Hi,
Spark does that out of the box for you :)
It compresses down the execution steps as much as possible.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev
Hi,
Regarding docker scripts I know i can change the base image easily but is
there any specific reason why the base image is hadoop_1.2.1 . Why is this
prefered to Hadoop2 [HDP2, CDH5]) distributions?
Now that amazon supports docker could this replace ec2-scripts?
Kind regards
Dimitri
Hi,
Does anyone know if it is possible to call the MetadaCleaner on demand? i.e.
rather than set spark.cleaner.ttl and have this run periodically, I'd like to
run it on demand. The problem with periodic cleaning is that it can remove rdd
that we still require (some calcs are short, others very
Hello,While trying to run this example below I am getting errors.I have build
Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0
SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean
assembly-Running the example using
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?
for classpath a simple script similar to hadoop classpath that shows what
needs to be added should be sufficient.
on spark standalone I can launch
Are there any gaps beyond convenience and code/config separation in using
spark-submit versus SparkConf/SparkContext if you are willing to set your
own config?
If there are any gaps, +1 on having parity within SparkConf/SparkContext
where possible. In my use case, we launch our jobs
The idea is to run a job that use images as input so that each work will
process a subset of the images
On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
RDD can only keep objects. How do you plan to encode these images so that
they can be loaded. Keeping the whole
val sem = 0
sc.addSparkListener(new SparkListener {
override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem +=1
}
})
sc = spark context
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
Hi all,
I am currently trying to save to Cassandra after some Spark Streaming
computation.
I call a myDStream.foreachRDD so that I can collect each RDD in the driver
app runtime and inside I do something like this:
myDStream.foreachRDD(rdd = {
var someCol = Seq[MyType]()
foreach(kv ={
Hi Team,
Could you please help me to resolve below query.
My use case is :
I'm using JavaStreamingContext to read text files from Hadoop - HDFS
directory
JavaDStreamString lines_2 =
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);
How to convert JavaDStreamString result
Is MyType serializable? Everything inside the foreachRDD closure has to be
serializable.
2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com:
Hi all,
I am currently trying to save to Cassandra after some Spark Streaming
computation.
I call a myDStream.foreachRDD so that I can
+1 to be able to do anything via SparkConf/SparkContext. Our app
worked fine in Spark 0.9, but, after several days of wrestling with
uber jars and spark-submit, and so far failing to get Spark 1.0
working, we'd like to go back to doing it ourself with SparkConf.
As the previous poster said, a
Hi,
For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards,
Laeeq
On Friday, May 23, 2014 10:33 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Well its hard to use text data
Hi,
I using spark 1.0.0 , using Ooyala Job Server, for a low latency query
system. Basically a long running context is created, which enables to run
multiple jobs under the same context, and hence sharing of the data.
It was working fine in 0.9.1. However in spark 1.0 release, the RDD's
created
Hi,
First use foreachrdd and then use collect as
DStream.foreachRDD(rdd = {
rdd.collect.foreach({
Also its better to use scala. Less verbose.
Regards,
Laeeq
On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi Team,
Could you please
Hi,
For QueueRDD, have a look here.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala
Regards,
Laeeq,
PhD candidatte,
KTH, Stockholm.
On Sunday, July 6, 2014 10:20 AM, alessandro finamore
alessandro.finam...@polito.it
+1 as well for being able to submit jobs programmatically without using
shell script.
we also experience issues of submitting jobs programmatically without using
spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar
to submit jobs in shell.
On Wed, Jul 9, 2014 at 9:47 AM,
Hi Luis,
Yes it's actually an ouput of the previous RDD.
Have you ever used the Cassandra Spark Driver on the driver app? I believe
these limitations go around that - it's designed to save RDDs from the
nodes.
tnks,
Rod
--
View this message in context:
did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...
On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
I using spark 1.0.0 , using Ooyala Job Server, for
Yes, I'm using it to count concurrent users from a kafka stream of events
without problems. I'm currently testing it using the local mode but any
serialization problem would have already appeared so I don't expect any
serialization issue when I deployed to my cluster.
2014-07-09 15:39 GMT+01:00
Xichen_tju,
I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU,
24GBRAM with 3 servers) and was not able to achieve a realistic scale for my
business domain needs. Storm is really only a framework, which allows you to
put in code to do whatever it is you need for a
Hi,
Yes . I am caching the RDD's by calling cache method..
May i ask, how you are sharing RDD's across jobs in same context? By the RDD
name. I tried printing the RDD's of the Spark context, and when the
referenceTracking is enabled, i get empty list after the clean up.
Thanks,
Prem
--
we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application
On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
Yes . I am caching the RDD's by calling cache method..
May i
I am trying to get my head around using Spark on Yarn from a perspective of
a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This
is done in yarn-client mode and it all works well.
In multiple examples, I see instances where people have setup Spark
Clusters in Stand Alone
One another +1. For me it's a question of embedding. With
SparkConf/SparkContext I can easily create larger projects with Spark as a
separate service (just like MySQL and JDBC, for example). With spark-submit
I'm bound to Spark as a main framework that defines how my application
should look like.
Spark still supports the ability to submit jobs programmatically without
shell scripts.
Koert,
The main reason that the unification can't be a part of SparkContext is
that YARN and standalone support deploy modes where the driver runs in a
managed process on the cluster. In this case, the
The idea behind YARN is that you can run different application types like
MapReduce, Storm and Spark.
I would recommend that you build your spark jobs in the main method without
specifying how you deploy it. Then you can use spark-submit to tell Spark how
you would want to deploy to it using
Greetings,
The documentation at
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
says:
Note that while it is also possible to pass a reference to a method in a
class instance (as opposed to a singleton object), this requires sending
the object that contains
sandy, that makes sense. however i had trouble doing programmatic execution
on yarn in client mode as well. the application-master in yarn came up but
then bombed because it was looking for jars that dont exist (it was looking
in the original file paths on the driver side, which are not available
To add to Ron's answer, this post explains what it means to run Spark
against a YARN cluster, the difference between yarn-client and yarn-cluster
mode, and the reason spark-shell only works in yarn-client mode.
Hi
My setup is to use localMode standalone, Sprak 1.0.0 release version, scala
2.10.4
I made a job that receive serialized object from Kafka broker. The objects
are serialized using kryo.
The code :
val sparkConf = new
SparkConf().setMaster(local[4]).setAppName(SparkTest)
Sandy, I experienced the similar behavior as Koert just mentioned. I don't
understand why there is a difference between using spark-submit and
programmatic execution. Maybe there is something else we need to add to the
spark conf/spark context in order to launch spark jobs programmatically
that
Hello,
I am currently learning Apache Spark and I want to see how it integrates
with an existing Hadoop Cluster.
My current Hadoop configuration is version 2.2.0 without Yarn. I have build
Apache Spark (v1.0.0) following the instructions in the README file. Only
setting the
Koert,
Yeah I had the same problems trying to do programmatic submission of spark jobs
to my Yarn cluster. I was ultimately able to resolve it by reviewing the
classpath and debugging through all the different things that the Spark Yarn
client (Client.scala) did for submitting to Yarn (like env
I am able to use Client.scala or LauncherExecutor.scala as my programmatic
entry point for Yarn.
Thanks,
Ron
Sent from my iPad
On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote:
+1 as well for being able to submit jobs programmatically without using shell
script.
we
Hi Patrick,
I used 1.0 branch, but it was not an official release, I just git pulled
whatever was there and compiled.
Thanks,
Mikhail
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html
Sent
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List
According to me there is BUG in MLlib Naive Bayes implementation in spark
0.9.1.
Whom should I report this to or with whom should I discuss? I can discuss
this over call as well.
My Skype ID : rahul.bhijwani
Phone no: +91-9945197359
Thanks,
--
Rahul K Bhojwani
3rd Year B.Tech
Computer
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Make a JIRA with enough detail to reproduce the error ideally:
https://issues.apache.org/jira/browse/SPARK
and then even more ideally open a PR with a fix:
https://github.com/apache/spark
On Wed, Jul 9, 2014 at 5:57 PM,
It means pulling the code from latest development branch from git
repository.
On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com
wrote:
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex
--
View this message in
Hi everyone,
I am new to Spark and I'm having problems to make my code compile. I have
the feeling I might be misunderstanding the functions so I would be very
glad to get some insight in what could be wrong.
The problematic code is the following:
JavaRDDBody bodies = lines.map(l - {Body b =
Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace
Hive, which
Thank you for the link. In that link the following is written:
For those familiar with the Spark API, an application corresponds to an
instance of the SparkContext class. An application can be used for a single
batch job, an interactive session with multiple jobs spaced apart, or a
long-lived
So basically, I have Spark on Yarn running (spark shell) how do I connect
to it with another tool I am trying to test using the spark://IP:7077 URL
it's expecting? If that won't work with spark shell, or yarn-client mode,
how do I setup Spark on Yarn to be able to handle that?
Thanks!
On
Spark doesn't currently offer you anything special to do this. I.e. if you
want to write a Spark application that fires off jobs on behalf of remote
processes, you would need to implement the communication between those
remote processes and your Spark application code yourself.
On Wed, Jul 9,
We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui
On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar
So how do I do the long-lived server continually satisfying requests in
the Cloudera application? I am very confused by that at this point.
On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Spark doesn't currently offer you anything special to do this. I.e. if
you
At first glance that looks like an error with the class shipping in the
spark shell. (i.e. the line that you type into the spark shell are
compiled into classes and then shipped to the executors where they run).
Are you able to run other spark examples with closures in the same shell?
Michael
I am using the Spark Streaming and have the following two questions:
1. If more than one output operations are put in the same StreamingContext
(basically, I mean, I put all the output operations in the same class), are
they processed one by one as the order they appear in the class? Or they
are
I’m just starting to use the Scala version of Spark’s shell, and I’d like
to add in a jar I believe I need to access Twitter data live, twitter4j
http://twitter4j.org/en/index.html. I’m confused over where and how to
add this jar in.
SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at a
time. This *can* be parallelized using an undocumented config parameter
spark.streaming.concurrentJobs which is by default set to 1.
2. Yes, the output
Great. Thank you!
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at
Hi Folks:
I am working on an application which uses spark streaming (version 1.1.0
snapshot on a standalone cluster) to process text file and save counters in
cassandra based on fields in each row. I am testing the application in two
modes:
* Process each row and save the counter
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars.
First jar is the dependency and the second one is my application. Hope this
will work for you.
./spark-shell --jars
Hi guys,
I'm a new user to spark. I would like to know is there an example of how to
user spark SQL and spark streaming together? My use case is I want to do
some SQL on the input stream from kafka.
Thanks!
Best,
Siyuan
Awww ye. That worked! Thank you Sameer.
Is this documented somewhere? I feel there there's a slight doc deficiency
here.
Nick
On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote:
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars.
Right, the compile error is a casting issue telling me I cannot assign
a JavaPairRDDPartition,
Body to a JavaPairRDDObject, Object. It happens in the mapToPair()
method.
On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote:
You forgot the compile error!
On Wed, Jul 9, 2014 at 6:14 PM,
Hello,
I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop
Distribution installed (v2.2.0 without yarn). Due to the fact that the
Hadoop Distribution that I am using, uses a list of jars , I do the
following changes to the conf/spark-env.sh
#!/usr/bin/env bash
export
Hi everybody
We have hortonworks cluster with many nodes, we want to test a deployment
of Spark. Whats the recomended path to follow?
I mean we can compile the sources in the Name Node. But i don't really
understand how to pass the executable jar and configuration to the rest of
the nodes.
Hi Tobias,
I was using Spark 0.9 before and the master I used was yarn-standalone. In
Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
sure whether it is the reason why more machines do not provide better
scalability. What is the difference between these two modes in
Krishna,
Ok, thank you. I just wanted to make sure that this can be done.
Cheers,
Nick
On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar ksanka...@gmail.com wrote:
Nick,
AFAIK, you can compile with yarn=true and still run spark in stand
alone cluster mode.
Cheers
k/
On Wed, Jul 9,
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in
all the nodes irrespective of Hadoop/YARN.
Cheers
k/
On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote:
I have a Spark app which runs well on local master. I'm now ready to
put it on a
Hi all,
I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.
For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which
Hi,This time instead of manually starting worker node using
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
I used start-slaves script on the master node. I also enabled -v (verbose flag)
in ssh. Here is the o/p that I see. The log file for to the worker node was not
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not
require Spark to be installed on all nodes manually. When you submit the
Spark assembly jar it will have all its dependencies. YARN will instantiate
Spark App Master Containers based on this jar.
--
View this message in
Hi,
Can anyone please shed more light on the PCA implementation in spark? The
documentation is a bit leaving as I am not sure I understand the output.
According to the docs, the output is a local matrix with the columns as
principal components and columns sorted in descending order of
Public service announcement:
If you're trying to do some stream processing on Twitter data, you'll need
version 3.0.6 of twitter4j http://twitter4j.org/archive/. That should
work with the Spark Streaming 1.0.0 Twitter library.
The latest version of twitter4j, 4.0.2, appears to have breaking
Bill,
I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.
On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi Tobias,
I was using Spark 0.9 before and the master
Hi Tobias,
Now I did the re-partition and ran the program again. I find a bottleneck
of the whole program. In the streaming, there is a stage marked as
*combineByKey
at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly
executed. However, during some batches, the number of executors
Siyuan,
I do it like this:
// get data from Kafka
val ssc = new StreamingContext(...)
val kvPairs = KafkaUtils.createStream(...)
// we need to wrap the data in a case class for registerAsTable() to succeed
val lines = kvPairs.map(_._2).map(s = StringWrapper(s))
val result = lines.transform((rdd,
Bill,
good to know you found your bottleneck. Unfortunately, I don't know how to
solve this; until know, I have used Spark only with embarassingly parallel
operations such as map or filter. I hope someone else might provide more
insight here.
Tobias
On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay
Hi guys,
I am a little confusing by the checkpointing in Spark Streaming. It
checkpoints the intermediate data for the stateful operations for sure.
Does it also checkpoint the information of StreamingContext? Because it
seems we can recreate the SC from the checkpoint in a driver node failure
So I do this from the Spark shell:
// set things up// snipped
ssc.start()
// let things happen for a few minutes
ssc.stop(stopSparkContext = false, stopGracefully = true)
Then I want to restart the Streaming Context:
ssc.start() // still in the shell; Spark Context is still alive
Which
I don't see why using SparkSubmit.scala as your entry point would be any
different, because all that does is invoke the main class of Client.scala
(e.g. for Yarn) after setting up all the class paths and configuration
options. (Though I haven't tried this myself)
2014-07-09 9:40 GMT-07:00 Ron
1 - 100 of 105 matches
Mail list logo