If you open up the driver UI (running on 4040), you can see multiple tasks
per stage which will be happening concurrently. If it is a single task, and
you want to increase the parallelism, then you can simply do a re-partition.
Thanks
Best Regards
On Tue, May 26, 2015 at 8:27 AM,
Hi All,
Please help me with the below 2 issues:
*Environment:*
I am running my spark cluster in stand alone mode.
I am initializing the spark context from inside my tomcat server.
I am setting below properties in environment.sh in $SPARK_HOME/conf
directory
I got a similar problem.I'm not sure if your problem is already resolved.
For the record, I solved this type of error by calling
sc..setMaster(yarn-cluster); If you find the solution, please let us know.
Regards,Mohammad
On Friday, March 6, 2015 2:47 PM, nitinkak001
Try these:
- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
You can also look into
we tried to cache table through
hiveCtx = HiveContext(sc)
hiveCtx.cacheTable(table name)
as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark
1.3.1 built with Hadoop 2.6
following error message would occur if we tried to cache table with parquet
format GZIP
though we're not
Hello!
Thank you all for your answers. Akhil's proposed solution works fine.
Thanks.
Florin
On Tue, May 26, 2015 at 3:08 AM, Wesley Miao wesley.mi...@gmail.com wrote:
The reason it didn't work for you is that the function you registered with
someRdd.map will be running on the
Thanks for the suggestion, I will try and post the outcome.
From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Monday, May 25, 2015 12:24 PM
To: Chaudhary, Umesh; user@spark.apache.org
Subject: Re: Websphere MQ as a data source for Apache Spark Streaming
Hi Umesh,
You can write a
An Executor is a JVM instance spawned and running on a Cluster Node (Server
machine). Task is essentially a JVM Thread – you can have as many Threads as
you want per JVM. You will also hear about “Executor Slots” – these are
essentially the CPU Cores available on the machine and granted for use
This is the first time I hear that “one can specify the RAM per task” – the RAM
is granted per Executor (JVM). On the other hand each Task operates on ONE RDD
Partition – so you can say that this is “the RAM allocated to the Task to
process” – but it is still within the boundaries allocated to
Hi,
In CF
String path = data/mllib/als/test.data;
JavaRDDString data = sc.textFile(path);
JavaRDDRating ratings = data.map(new FunctionString, Rating() {
public Rating call(String s) {
String[] sarray = s.split(,);
return new Rating(Integer.parseInt(sarray[0]), Integer
.parseInt(sarray[1]),
Hi,
I am currently experimenting with linear regression (SGD) (Spark + MLlib,
ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
do this (for now) by an exhaustive grid search of the step size and the
number of iterations. Currently I am on a dual core that acts as a
Hello,
I would like to switch from Scala 2.10 to 2.11 for Spark app development.
It seems that the only thing blocking me is a missing
spark-streaming-kafka_2.11 maven package. Any plan to add it or am I just
blind?
Many thanks,
Vladimir
Since spark can run multiple tasks in one executor, so I am curious to know
how does spark manage memory across these tasks. Say if one executor takes
1GB memory, then if this executor can run 10 tasks simultaneously, then
each task can consume 100MB on average. Do I understand it correctly ? It
Thanks Akhil, I checked the job UI again ,my app is running
concurrently in all the executors. But some of the tasks got I/O exception. I
will continue inspecting on this.
java.io.IOException: Failed to create local dir in
Yes, I know that one task represent a JVM thread. This is what I confused.
Usually users want to specify the memory on task level, so how can I do it
if task if thread level and multiple tasks runs in the same executor. And
even I don't know how many threads there will be. Besides that, if one
I believe you would be restricted by the number of cores you have in your
cluster. Having a worker running without a core is useless.
On Tue, May 26, 2015 at 3:04 PM, canan chen ccn...@gmail.com wrote:
In spark standalone mode, there will be one executor per worker. I am
wondering how many
I think the concept of task in spark should be on the same level of task in
MR. Usually in MR, we need to specify the memory the each mapper/reducer
task. And I believe executor is not a user-facing concept, it's a spark
internal concept. For spark users they don't need to know the concept of
Hi,
I have a couple of use cases for Apache Spark applications/scripts,
generally of the following form:
*General ETL use case* - more specifically a transformation of a
Cassandra column family containing many events (think event sourcing) into
various aggregated column families.
On Tue, May 26, 2015 at 10:09 AM, algermissen1971
algermissen1...@icloud.com wrote:
Hi,
I am setting up a project that requires Kafka support and I wonder what
the roadmap is for Scala 2.11 Support (including Kafka).
Can we expect to see 2.11 support anytime soon?
The upcoming 1.4
Hi Evo,
Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you
would be able to run multiple executors on the same JVM/worker.
https://issues.apache.org/jira/browse/SPARK-1706.
Thanks
Arush
On Tue, May 26, 2015 at 2:54 PM, canan chen ccn...@gmail.com wrote:
I think the
In spark standalone mode, there will be one executor per worker. I am
wondering how many executor can I acquire when I submit app ? Is it greedy
mode (as many as I can acquire )?
That is a much better solution than how I resolved it. I got around it by
placing comma separated jar paths for all the hive related jars in --jars
clause.
I will try your solution. Thanks for sharing it.
On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam misla...@yahoo.com wrote:
I got a similar
I know it isn't exactly what you are asking for, but you could solve it
like this:
Driver program queries dynamo for the s3 file keys.
sc.textFile each of the file keys and .union them all together to make your
RDD.
You could wrap that up in a function and it wouldn't be too painful to
reuse. I
Hi,
I just configured my cluster to run with 1.4.0-rc2, alas the dependency
jungle does not one let just download, config and start. Instead one
will have to fiddle with sbt settings for the upcoming couple of nights:
2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor -
the link you sent says multiple executors per node
Worker is just demon process launching Executors / JVMs so it can execute tasks
- it does that by cooperating with the master and the driver
There is a one to one maping between Executor and JVM
Sent from Samsung Mobile
div
It's being added in 1.4
https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/
On Tue, May 26, 2015 at 3:14 AM, Petr Novak oss.mli...@gmail.com wrote:
Hello,
I would like to switch from Scala 2.10 to 2.11 for Spark app
Hello all,
I have accumulator in spark streaming application which counts number of
events received from Kafka.
From the documentation , It seems Spark UI has support to display it .
But I am unable to see it on UI. I am using spark 1.3.1
Do I need to call any method (print) or am I missing
Hello,
I am trying to run a spark job (which runs fine on the master node of the
cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
a rdd.saveAsTextFile() line in it, I get the following error:
*SystemError: unknown opcode*
The entire stacktrace has been appended to
You need to make sure to name the accumulator.
On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com
wrote:
Hello all,
I have accumulator in spark streaming application which counts number of
events received from Kafka.
From the documentation , It seems Spark UI has
Have you looked at https://github.com/spark/sparkjs ?
Cheers
On Tue, May 26, 2015 at 10:17 AM, marcos rebelo ole...@gmail.com wrote:
Hi all,
My first message on this mailing list:
I need to run JavaScript on Spark. Somehow I would like to use the
ScriptEngineManager or any other way that
Is it just me or does that look completely unrelated to
Spark-the-Apache-project?
On Tue, May 26, 2015 at 10:55 AM, Ted Yu yuzhih...@gmail.com wrote:
Have you looked at https://github.com/spark/sparkjs ?
Cheers
On Tue, May 26, 2015 at 10:17 AM, marcos rebelo ole...@gmail.com wrote:
Hi
Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the
Spark project has published maven artifacts that are compiled with 2.11 and
2.10, although the downloads at http://spark.apache.org/downloads.html are
still all for 2.10.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd
w.r.t. Kafka library, see
https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/
FYI
On Tue, May 26, 2015 at 8:33 AM, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Yes, recommended version is 2.10 as all the
we are still running into issues with spark-shell not working on 2.11, but
we are running on somewhat older master so maybe that has been resolved
already.
On Tue, May 26, 2015 at 11:48 AM, Dean Wampler deanwamp...@gmail.com
wrote:
Most of the 2.11 issues are being resolved in Spark 1.4. For a
Hi folks, for those of you working with Cassandra, wondering if anyone has
been successful processing a mix of Cassandra and hdfs data. I have a
dataset which is stored partially in HDFS and partially in Cassandra
(schema is the same in both places)
I am trying to do the following:
val dfHDFS =
This should be the case that you run different versions for Python in
driver and slaves, Spark 1.4 will double check that will release
soon).
SPARK_PYTHON should be PYSPARK_PYTHON
On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar nmural...@gmail.com wrote:
Hello,
I am trying to run a
Yop, why not using like you said a js engine le rhino? But then I would
suggest using mapPartition instead si only one engine per partition.
Probably broadcasting the script is also a good thing to do.
I guess it's for add hoc transformations passed by a remote client,
otherwise you could simply
Hi all
Let me be clear, I'm speaking of Spark (big data, map/reduce, hadoop, ...
related). I have multiple map/flatMap/groupBy and one of the steps needs to
be a map passing the item inside a JavaScript code.
2 Questions:
- Is this question related to this list?
- Did someone do something
Hi
I don't know how it works. For example:
val result = joinedData.groupBy(col1,col2).agg(
count(lit(1)).as(counter),
min(col3).as(minimum),
sum(case when endrscp 100 then 1 else 0 end).as(test)
)
How can I do it?
Thanks
Regards.
Miguel.
On Tue, May 26, 2015 at 12:35 AM, ayan guha
I’ve been attempting to run a SparkR translation of a similar Scala job that
identifies words from a corpus not existing in a newline delimited dictionary.
The R code is:
dict - SparkR:::textFile(sc, src1)
corpus - SparkR:::textFile(sc, src2)
words - distinct(SparkR:::flatMap(corpus,
Dear Spark developers and users,
Am I correct in believing that the recommended version of Scala to use with
Spark is currently 2.10? Is there any plan to switch to 2.11 in future? Are
there any advantages to using 2.11 today?
Regards,
Punya
It is Hadoop-2.4.0 with spark-1.3.0.
I found that the problem only happen if there are multi nodes. If the cluster
has only one node, it works fine.
For example if the cluster has a spark-master on machine A and a spark-worker
on machine B, this problem happen. If both spark-master and
Hello,
I am trying to build scala doc from the 1.4 branch. But it failed due to
[error] (sql/compile:compile) java.lang.AssertionError: assertion failed:
List(object package$DebugNode, object package$DebugNode)
I followed the instruction on github
For this, I can give you a SQL solution:
joined data.registerTempTable('j')
Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum,
sum(case when endrscp100 then 1 else 0 end test from j'
Let me know if this works.
On 26 May 2015 23:47, Masf masfwo...@gmail.com wrote:
Hi
I don't
Yes you are in right mailing list, for sure :)
Regarding your question, I am sure you are well versed with how spark
works. Essentially you can run any arbitrary function with map call and it
will run in remote nodes. Hence you need to install any needed dependency
in all nodes. You can also pass
Hi!
I create DataFrame using method following
JavaRDDRow rows = ...
StructType structType = ...
Then apply sqlContext.createDataFrame(rows, structType).
I have pretty complex schema:
root
|-- Id: long (nullable = true)
|-- attributes: struct (nullable = true)
||-- FirstName: array
Hi all,
My first message on this mailing list:
I need to run JavaScript on Spark. Somehow I would like to use the
ScriptEngineManager or any other way that makes Rhino do the work for me.
Consider that I have a Structure that needs to be changed by a JavaScript.
I will have a set of Javascript
47 matches
Mail list logo