I'm also pretty interested how to create custom Sinks in Spark. I'm using it
with Ganglia and the normal metrics from JVM source do show up. I tried to
create my own metric based on Issac's code, but does not show up in Ganglia.
Does anyone know where is the problem?
Here's the code snippet:
I meant custom Sources, sorry.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-metrics-in-spark-application-tp188p10386.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Denes,
I think you can register your customized metrics source into metrics system
through metrics.properties, you can take metrics.propertes.template as
reference,
Basically you can do as follow if you want to monitor on executor:
Hi, I'm using local mode and read a text file as RDD using
JavaSparkContext.textFile() API.
And then call cache() method on the result RDD.
I look at the Storage information and find the RDD has 3 partitions but
2 of them have been cached.
Is this a normal behavior? I assume all of
Yes, it's normal when memory is not enough to put the third partition, as you
can see in your attached picture.
Thanks
Jerry
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, July 22, 2014 3:09 PM
To: user@spark.apache.org
Subject: number of Cached Partitions v.s. Total Partitions
Hi, Yin Huai
I test again with your snippet code.
It works well in spark-1.0.1
Here is my code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
case class Record(data_date: String, mobile: String, create_time: String)
val mobile = Record(2014-07-20,1234567,2014-07-19)
Hi All,
Can someone help on this?
I'm encountering exactly the same issue in a very similar scenario with the
same spark version.
Thanks
Alessandro
On Fri, Jul 18, 2014 at 8:30 PM, Shannon Quinn squ...@gatech.edu wrote:
Hi all,
I'm dealing with some strange error messages that I *think*
I've just have the same problem.
I'm using
pre
$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client $JOBJAR
--class $JOBCLASS
/pre
It's really strange, because the log shows that
pre
14/07/22 16:16:58 INFO ui.SparkUI: Started SparkUI at
http://k1227.mzhen.cn:4040
14/07/22 16:16:58
Ok, thanks for the answers. Unfortunately, there is no sc.getPersistentRDDs
for pyspark.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/gain-access-to-persisted-rdd-tp10313p10393.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
That's what my problem is:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-spark-submit-command-hangs-tp10308p10394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
What about simply:
dstream.foreachRDD(_.saveAsSequenceFile(...))
?
On Tue, Jul 22, 2014 at 2:06 AM, Barnaby bfa...@outlook.com wrote:
First of all, I do not know Scala, but learning.
I'm doing a proof of concept by streaming content from a socket, counting
the words and write it to a
Hi Jerry,
I know that way of registering a metrics, but it seems defeat the whole
purpose. I'd like to define a source that is set within the application, for
example number of parsed messages.
If I register it in the metrics.properties, how can I obtain the instance?
(or instances?)
How can I
Hi spark.
I see there has been some work around graphviz visualization for spark jobs.
1) I'm wondering if anyone actively maintaining this stuff, and if so what
the best docs are for it - or else, if there is interest in an upstream
JIRA for updating the graphviz APIs it.
2) Also, am curious
Yeah, I start to know your purpose. Original design purpose of customized
metrics source is focused on self-contained source, seems you need to rely on
outer variable, so the way you mentioned may be is the only way to register.
Besides, as you cannot see the source in Ganglia, I think you can
Running a simple collect method on a group of Avro objects causes a plain
NullPointerException. Does anyone know what may be wrong?
files.collect()
Press ENTER or type command to continue
Exception in thread Executor task launch worker-0
java.lang.NullPointerException
at
For those curious I was using KryoRegistrator it was causing some null
pointer exception. I removed the code and problem went away.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-small-list-causes-NullPointerException-tp10400p10402.html
Sent
Do you have a list/array in your avro record? If yes this could cause the
problem. I experienced this kind of problem and solved it by providing
custom kryo ser/de for avro lists. Also be carefull spark reuses records,
so if you just read and then don't copy/transform them you would end up
with
Hi,
Where can I find the version of Hadoop my cluster is using? I launched my
ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2
option. However, the folder hadoop-native/lib in the master node only
contains files that end in 1.0.0. Does that mean that I have Hadoop version
Hi,
I am analysing the application processing on Spark(GraphX), but feeling a
little confused on some items of webUI.
1) what is the difference between Duration(Stages - Completed Stages) and
Task Time(Executors) ?
for instance, 43s VS. 5.6 m
Task Time is approximated to Duration multiplied
Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
A minimal example:
case class P(name:String)
val ps = Array(P(alice), P(bob), P(charly), P(bob))
sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
[Spark shell local mode] res : Array[(P, Int)] =
Hi TD,
Eventually I found that I made a mistake - the RDD I used for join does
not contain any content.
Now it works.
Thanks,
Hawk
On 2014年07月21日 17:58, Tathagata Das wrote:
Could you share your code snippet so that we can take a look?
TD
On Mon, Jul 21, 2014 at 7:23 AM, hawkwang
Just to narrow down the issue, it looks like the issue is in 'reduceByKey'
and derivates like 'distinct'.
groupByKey() seems to work
sc.parallelize(ps).map(x= (x.name,1)).groupByKey().collect
res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)),
(abe,ArrayBuffer(1)),
Hi guys,
Is it possible to generate a single stream rdd which can be updated with
new batch rdd content?
I know that we can use updateStateByKey to make aggregation,
but here just want to keep tracking all historical original content.
I also noticed that we can save to redis or other storage
I am also having exactly the same problem, calling using pyspark. Has
anyone managed to get this script to work?
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Wed, Jul 16, 2014 at 2:10 PM, Ian Wilkinson ia...@me.com wrote:
Hi,
I’m trying to run the Spark
I can confirm this bug. The behavior for groupByKey is the same as
reduceByKey - your example is actually grouping on just the name. Try this:
sc.parallelize(ps).map(x= (x,1)).groupByKey().collect
res1: Array[(P, Iterable[Int])] = Array((P(bob),ArrayBuffer(1)),
(P(bob),ArrayBuffer(1)),
Hi All,
I am getting events from flume using following line.
JavaDStreamSparkFlumeEvent flumeStream = FlumeUtils.createStream(ssc, host,
port);
Each event is a delimited record. I like to use some of the transformation
functions like map and reduce on this. Do I need to convert the
Yes, right. 'sc.parallelize(ps).map(x= (**x.name**,1)).groupByKey().collect
'
An oversight from my side.
Thanks!, Gerard.
On Tue, Jul 22, 2014 at 5:24 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I can confirm this bug. The behavior for groupByKey is the same as
reduceByKey - your
Thanks Sean! I got that working last night similar to how you solved it. Any
ideas about how to monitor that same folder in another script by creating a
stream? I can use sc.sequenceFile() to read in the RDD, but how do I get the
name of the file that got added since there is no
I created https://issues.apache.org/jira/browse/SPARK-2620 to track this.
Maybe useful to know, this is a regression on Spark 1.0.0. I tested the
same sample code on 0.9.1 and it worked (we have several jobs using case
classes as key aggregators, so it better does)
-kr, Gerard.
On Tue, Jul 22,
i took this over from parviz.
i recently submitted a new PR for Kinesis Spark Streaming support:
https://github.com/apache/spark/pull/1434
others have tested it with good success, so give it a whirl!
waiting for it to be reviewed/merged. please put any feedback into the PR
directly.
thanks!
On standalone there is still special handling for assigning tasks within
executors. There just isn't special handling for where to place executors,
because standalone generally places an executor on every node.
On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang hw...@qilinsoft.com wrote:
Sandy,
Hi,
I don't think anybody has been testing importing of Impala tables
directly. Is there any chance to export these first, say as
unpartitioned Hive tables and import these? Just an idea..
Andre
On 07/21/2014 11:46 PM, chutium wrote:
no, something like this
14/07/20 00:19:29 ERROR
Hi,
For running spark sql, the dataneuclus*.jar are automatically added in
classpath, this works fine for spark standalone mode and yarn-client mode,
however, for Yarn-cluster mode, I have to explicitly put these jars using
--jars option when submitting this job, otherwise, the job will fail, why
I could possible use Spark API and write an batch app to provide some per web
page stats such as views, uniques etc. The same can be achieved using
SparkSQL, so wanted to check:
* what are the best practices and pros/cons of either of the approaches?
* Does SparkSQL require registerAsTable for
Hi Earthson,
Is your problem resolved? The way you submit your application looks alright
to me; spark-submit should be able to parse the combination of --master and
--deploy-mode correctly. I suspect you might have hard-coded yarn-cluster
or something in your application.
Andrew
2014-07-22
Hi Maria,
Having files that end with 1.0.0 means you're Spark 1.0, not Hadoop 1.0.
You can check your hadoop version by running $HADOOP_HOME/bin/hadoop
version, where HADOOP_HOME is set to your installation of hadoop. On the
clusters started by the Spark ec2 scripts, this should be
Hi Tobias,
I tried to use 10 as numPartition. The number of executors allocated is the
number of DStream. Therefore, it seems the parameter does not spread data
into many partitions. In order to to that, it seems we have to do
repartition. If numPartitions will distribute the data to multiple
I don't understand what you're trying to do.
The code will use log4j under the covers. The default configuration
means writing log messages to stderr. In yarn-client mode that is your
terminal screen, in yarn-cluster mode that is redirected to a file by
Yarn. For the executors, that will always
Hi all,
I am currently running a Spark Streaming program, which consumes data from
Kakfa and does the group by operation on the data. I try to optimize the
running time of the program because it looks slow to me. It seems the stage
named:
* combineByKey at ShuffledDStream.scala:42 *
always
I haven't had a chance to look at the details of this issue, but we have
seen Spark successfully read Parquet tables created by Impala.
On Tue, Jul 22, 2014 at 10:10 AM, Andre Schumacher andre.sc...@gmail.com
wrote:
Hi,
I don't think anybody has been testing importing of Impala tables
Hello All,
Basically i need to edit the log4j.properties to filter some of the
unnecessary logs in spark on yarn-client mode. I am not sure where can i
find log4j.properties file (location). Can any one help me on this.
--
View this message in context:
I fixed the error with the yarn-client mode issue which i mentioned in my
earlier post. Now i want to edit the log4j.properties to filter some of the
unnecessary logs. Can you let me know where can i find this properties file.
--
View this message in context:
Hi all,
I am running a spark streaming job. The job hangs on one stage, which shows
as follows:
Details for Stage 4
Summary MetricsNo tasks have started yetTasksNo tasks have started yet
Does anyone have an idea on this?
Thanks!
Bill
Bill
Hi there,
I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:
1) For each map function, it need access some huge files, which is around
6GB
2) These files are READ-ONLY. Actually they are like some huge look-up
table, which will not change
Is the first() being computed locally on the driver program? Maybe it's to hard
to compute with the memory, etc available there. Take a look at the driver's
log and see whether it has the message Computing the requested partition
locally.
Matei
On Jul 22, 2014, at 12:04 PM, Nathan Kronenfeld
Hi Sean,
Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding.
From a user perspective, what would you recommend to build Spark with Hive
0.12 / 0.13+ libraries moving forward and deploy to production cluster that
runs on a older version of Hadoop (e.g. 2.2 or 2.4) ?
You can upload your own log4j.properties using spark-submit's
--files argument.
On Tue, Jul 22, 2014 at 12:45 PM, abhiguruvayya
sharath.abhis...@gmail.com wrote:
I fixed the error with the yarn-client mode issue which i mentioned in my
earlier post. Now i want to edit the log4j.properties to
I tried to map SparkFlumeEvents to String of RDDs like below. But that map and
call are not at all executed. I might be doing this in a wrong way. Any help
would be appreciated.
flumeStream.foreach(new FunctionJavaRDDSparkFlumeEvent,Void () {
@Override
public Void
Thanks i am able to load the file now. Can i turn off specific logs using
log4j.properties. I don't want to see the below logs. How can i do this.
14/07/22 14:01:24 INFO scheduler.TaskSetManager: Starting task 2.0:129 as
TID 129 on executor 3: ** (NODE_LOCAL)
14/07/22 14:01:24 INFO
Now we are storing Data direct from Kafka to Parquet.
We are currently using Camus and wanted to know how you went about storing
to Parquet?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-kafka-spark-Parquet-tp10037p10441.html
Sent from the Apache
The spark log classes are based on the actual class names. So if you
want to filter out a package's logs you need to specify the full
package name (e.g. org.apache.spark.storage instead of just
spark.storage).
On Tue, Jul 22, 2014 at 2:07 PM, abhiguruvayya
sharath.abhis...@gmail.com wrote:
Hi guys,
I'm able to run some Spark SQL example but the sql is static in the code. I
would like to know is there a way to read sql from somewhere else (shell
for example)
I could read sql statement from kafka/zookeeper, but I cannot share the sql
to all workers. broadcast seems not working for
Do you mean that the texts of the SQL queries being hardcoded in the
code? What do you mean by cannot shar the sql to all workers?
On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com hsy...@gmail.com wrote:
Hi guys,
I'm able to run some Spark SQL example but the sql is static in the code. I
Hi Folks,
I have been trying to dig up some information in regards to what are the
possibilities when wanting to deploy more than one client process that
consumes Spark.
Let's say I have a Spark Cluster of 10 servers, and would like to setup 2
additional servers which are sending requests to it
Sorry, typo. What I mean is sharing. If the sql is changing at runtime, how
do I broadcast the sql to all workers that is doing sql analysis.
Best,
Siyuan
On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang zonghen...@gmail.com wrote:
Do you mean that the texts of the SQL queries being hardcoded
Can you paste a small code example to illustrate your questions?
On Tue, Jul 22, 2014 at 5:05 PM, hsy...@gmail.com hsy...@gmail.com wrote:
Sorry, typo. What I mean is sharing. If the sql is changing at runtime, how
do I broadcast the sql to all workers that is doing sql analysis.
Best,
On Tue, Jul 22, 2014 at 7:08 AM, Yifan LI iamyifa...@gmail.com wrote:
1) what is the difference between Duration(Stages - Completed Stages)
and Task Time(Executors) ?
Stages are composed of tasks that run on executors. Tasks within a stage
may run concurrently, since there are multiple
Thanks Chen
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Joining-by-timestamp-tp10367p10449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am trying to create spark cluster using spark-ec2 file under spark1.0.1
directory.
1) I noticed that It is always creating hadoop version 1.0.4.Is there a way
I can override that?I would like to have hadoop2.0.2
2) I also wants install Oozie along with. Is there any scrips available
along
For example, this is what I tested and work on local mode, what it does is
it get data and sql query both from kafka and do sql on each RDD and output
the result back to kafka again
I defined a var called *sqlS. * In the streaming part as you can see I
change the sql statement if it consumes a sql
Hi,
as far as I know, after the Streaming Context has started, the processing
pipeline (e.g., filter.map.join.filter) cannot be changed. As your SQL
statement is transformed into RDD operations when the Streaming Context
starts, I think there is no way to change the statement that is executed on
I have a sample application pumping out records 1 per second. The batch
interval is set to 5 seconds. Here’s a list of “observed window intervals” vs
what was actually set
window=25, slide=25 : observed-window=25, overlapped-batches=0
window=25, slide=20 : observed-window=20,
But how do they do the interactive sql in the demo?
https://www.youtube.com/watch?v=dJQ5lV5Tldw
And if it can work in the local mode. I think it should be able to work in
cluster mode, correct?
On Tue, Jul 22, 2014 at 5:58 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
as far as I know,
I download the spark 1.0.1, but I cannot find the PowerGraph abstraction
mentioned in the GraphX paper.
What I can find is the pregel abstraction.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-the-PowerGraph-abstraction-tp10457.html
Sent from
I will take a look at it tomorrow!
TD
On Tue, Jul 22, 2014 at 9:30 AM, Chris Fregly ch...@fregly.com wrote:
i took this over from parviz.
i recently submitted a new PR for Kinesis Spark Streaming support:
https://github.com/apache/spark/pull/1434
others have tested it with good success,
Can you give an idea of the streaming program? Rest of the transformation
you are doing on the input streams?
On Tue, Jul 22, 2014 at 11:05 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am currently running a Spark Streaming program, which consumes data from
Kakfa and does the
It is caused by a bug in Spark REPL. I still do not know which part of the
REPL code causes it... I think people working REPL may have better idea.
Regarding how I found it, based on exception, it seems we pulled in some
irrelevant stuff and that import was pretty suspicious.
Thanks,
Yin
On
It could be related to this bug that is currently open.
https://issues.apache.org/jira/browse/SPARK-1312
Here is a workaround. Can you put a inputStream.foreachRDD(rdd = { }) and
try these combos again?
TD
On Tue, Jul 22, 2014 at 6:01 PM, Alan Ngai a...@opsclarity.com wrote:
I have a sample
This is because of the RDD's lazy evaluation! Unless you force a
transformed (mapped/filtered/etc.) RDD to give you back some data (like
RDD.count) or output the data (like RDD.saveAsTextFile()), Spark will not
do anything.
So after the eventData.map(...), if you do take(10) and then print the
Hello Andrew,
Thank you very much for your great tips. Your solution worked perfectly.
In fact, I was not aware that the right option for local mode is
--driver.memory 1g
Cheers,
Rindra
On Mon, Jul 21, 2014 at 11:23 AM, Andrew Or-2 [via Apache Spark User List]
As far as I understand even if I could register the custom source, there is
no way to have a cluster-wide variable to pass to it, i.e. the accumulator
can be modified by tasks, but only the driver can read it and the broadcast
value is constant.
So it seems this custom metrics/sinks fuctionality
Hi Xiangrui,
By using your treeAggregate and broadcast patch, the evaluation has been
processed successfully.
I expect that these patches are merged in the next major release
(v1.1?). Without them, it would be hard to use mllib for a large dataset.
Thanks,
Makoto
(2014/07/16 15:05),
72 matches
Mail list logo