Hi.. I am new to Spark . Is it possible to capture live packets from a
network interface through spark streaming? Is there a library or any built
in classes to bind to the network interface directly?
--
View this message in context:
Hi All,
Is it possible to map and filter a javardd in a single operation?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Map-with-filter-on-JavaRdd-tp8401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
It happens in a single operation itself. You may write it separately but
the stages are performed together if its possible. You will see only one
task in the output of your application.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Hi Akhil,
The IP is correct and is able to start the workers when we start it as a java
command.Its becoming 192.168.125.174:0 when we call from the scripts.
Thanks Regards,
Meethu M
On Friday, 27 June 2014 1:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
why is it binding to
Hi,
I have a scenario where I am having a class X with constructor parameter as
(RDD,Double).When I am initializing the the class object with corresponding
RDD and double value (of name say x1) and putting it as a vertex attribute
in graph , I am losing my RDD value . The Double value remains
Thanks for having corrected this bug!
The fix version is marked as 1.1.0 ( SPARK-1552
https://issues.apache.org/jira/browse/SPARK-1552 ). I have tested my code
snippet with Spark 1.0.0 (Scala 2.10.4) and it works. I don't know if it's
important to mention it.
Pierre-Alexandre
--
View this
Thanks Mayur for clarification..
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Map-with-filter-on-JavaRdd-tp8401p8410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have not used this, only watched a presentation of it in spark summit 2013.
https://github.com/radlab/sparrow
https://spark-summit.org/talk/ousterhout-next-generation-spark-scheduling-with-sparrow/
Pure conjecture from your high scheduling latency and the size of your
cluster, it seems one way
Hello Sebastien,
it is not working with the 1.0 branch either.
I decided to compile spark from source precisely because of the
[SPARK-2204] fix, because before that I couldn't get fine-grained working
at all.
Now it works fine if the cluster is only composed of Ubuntu 14.04 nodes,
and when I
My experience is that gaining 20 spot instances accounts for a tiny
fraction of the total time of provisioning a cluster with spark-ec2. This
is not (solely) an AWS issue.
--
Martin Goodson | VP Data Science
(0)20 3397 1240
[image: Inline image 1]
On Thu, Jun 26, 2014 at 10:14 PM, Nicholas
On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com wrote:
Summingbird is for map/reduce. Dataflow is the third generation of google's
map/reduce, and it generalizes map/reduce the way Spark does. See more about
this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
Yes, my point
... and to be clear on the point, Summingbird is not limited to MapReduce.
It abstracts over Scalding (which abstracts over Cascading, which is being
moved from MR to Spark) and over Storm for event processing.
On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote:
On Thu, Jun
Hello!
I have just started trying out Spark to see if it fits my needs, but I
am running into some issues when trying to port the
CassandraCQLTest.scala example into Java. The specific errors etc.
that I encounter can be seen here:
I put the settings as you specified in spark-env.sh for the master. When
I run start-all.sh, the web UI shows both the worker on the master
(machine1) and the slave worker (machine2) as ALIVE and ready, with the
master URL at spark://192.168.1.101. However, when I run spark-submit,
it
No joy, unfortunately. Same issue; see my previous email--still crashes
with address already in use.
On 6/27/14, 1:54 AM, sujeetv wrote:
Try to explicitly set set the spark.driver.host property to the master's
IP.
Sujeet
--
View this message in context:
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*,
exactly as configured.
On 6/27/14, 9:07 AM, Shannon Quinn wrote:
I put the settings as you specified in spark-env.sh for the master.
When I run start-all.sh, the web UI shows both the worker on the
master (machine1) and
Hi,
I have a scenario where I am having a class X with constructor parameter as
(RDD,Double).When I am initializing the the class object with corresponding
RDD and double value (of name say x1) and *putting it as a vertex attribute
in graph* , I am losing my RDD value . The Double value remains
Hi all,
I can start a spark streaming app in Client mode on a Pseudo-standalone
cluster on my local machine.
However when I tried to start it in Cluster mode. It always get the
following exception on the Driver.
Exception in thread main akka.ConfigurationException: Could not
start logger due to
Another question. In the foreachRDD I will initialize the JobConf, but in
this place how can I get information from the items?
I have an identifier in the data which identify the required ES index (so
how can I set dynamic index in the foreachRDD) ?
b0c1
I too felt the same Nick but I don't have root privileges on the cluster,
unfortunately. Are there any alternatives?
On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com wrote:
I've not tried this - but numpy is a tricky and complex package with many
dependencies on Fortran/C
Would deploying virtualenv on each directory on the cluster be viable?
The dependencies would get tricky but I think this is the sort of
situation it's built for.
On 6/27/14, 11:06 AM, Avishek Saha wrote:
I too felt the same Nick but I don't have root privileges on the
cluster, unfortunately.
I suppose along those lines, there's also Anaconda:
https://store.continuum.io/cshop/anaconda/
On 6/27/14, 11:13 AM, Nick Pentreath wrote:
Hadoopy uses http://www.pyinstaller.org/ to package things up into an
executable that should be runnable without root privileges. It says it
support numpy
I got an answer on SO on this question, basically confirming that the
CQLSSTableWrite cannot be used in Spark (at least in the form shown in the
code snippet). DataStax filed a bug on that and might get solved on a
future version.
As you have observed, a single writer can only be used in serial
Hi Rohit,
Thanks for your message. We are currently on Spark 0.9.1, Cassandra 2.0.6
and Calliope GA (Would love to try the pre-release version if you want
beta testers :-) Our hadoop version is CDH4.4 and of course our spark
assembly is compiled against it.
We have got really interesting
So far Spark Job Server does not work with Spark 1.0:
https://github.com/ooyala/spark-jobserver
So this works only with Spark 0.9 currently:
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
Romain
Romain
On Tue, Jun 24, 2014 at 9:04 AM,
If for some reason it would be easier to do your mapping and filtering in a
single function, you can also use RDD.flatMap (returning an empty sequence
is equivalent to a filter). But unless you have good reason you should have
a separate map and filter transform, as Mayur said.
On Fri, Jun 27,
Hi all,
I can start a spark streaming app in Client mode on a Pseudo-standalone
cluster on my local machine.
However when I tried to start it in Cluster mode. It always got the
following exception on the Driver.
Exception in thread main akka.ConfigurationException: Could not
start logger due to
Hi,
I tried to develop some code to use Logistic Regression, following the code
in BinaryClassification.scala in examples/mllib. My code compiles, but at
runtime complains that scopt/OptionParser class cannot be found. I have the
following import statement in my code:
import scopt.OptionParser
Hello Mayur,
Are you using SparkListener interface java API? I tried using it but was
unsuccessful. So need few more inputs.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8438.html
Sent from the Apache Spark User List mailing
For some reason, commenting out spark.driver.host and spark.driver.port
fixed something...and broke something else (or at least revealed another
problem). For reference, the only lines I have in my spark-defaults.conf
now:
spark.app.name myProg
spark.master
Looks like your driver is not able to connect to the remote executor on
machine2/130.49.226.148:60949. Cn you check if the master machine can
route to 130.49.226.148
Sujeet
On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu wrote:
For some reason, commenting out
Apologies; can you advise as to how I would check that? I can certainly
SSH from master to machine2.
On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote:
Looks like your driver is not able to connect to the remote executor
on machine2/130.49.226.148:60949 http://130.49.226.148:60949/. Cn
you check
Hi Siyuan,
Can you try this solution?
http://stackoverflow.com/questions/21943353/akka-2-3-0-fails-to-load-slf4jeventhandler-class-with-java-lang-classnotfounde
Best
Date: Fri, 27 Jun 2014 14:18:59 -0400
Subject: problem when start spark streaming in cluster mode
From: hsy...@gmail.com
To:
Hi:
I am using spark to stream data to cassandra and it works fine in local mode.
But when I execute the application in a standalone clustered env I got
exception included below (java.lang.NoClassDefFoundError:
org/codehaus/jackson/annotate/JsonClass).
I think this is due to the
I give up, communication must be blocked by the complex EC2 network topology
(though the error information indeed need some improvement). It doesn't make
sense to run a client thousands miles away to communicate frequently with
workers. I have moved everything to EC2 now.
--
View this message
Hi Kyle,
A few questions:
1) Did you use `setIntercept(true)`?
2) How many features?
I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?
Best,
Xiangrui
On Fri, Jun 27, 2014 at 8:10 AM,
Try to use --executor-memory 12g with spark-summit. Or you can set it
in conf/spark-defaults.properties and rsync it to all workers and then
restart. -Xiangrui
On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng pc...@uow.edu.au wrote:
I give up, communication must be blocked by the complex EC2 network
This will be handy for demo and quick prototyping as the command-line REPL
doesn't support a lot of editor features, also, you don't need to ssh into
your worker/master if your client is behind an NAT wall. Since Spark
codebase has a minimalistic design philosophy I don't think this component
can
Hey Haoming,
Actually akka.loggers has already been set to
akka.event.slf4j.Slf4jLogger. You can check
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
Regards,
SY
On Fri, Jun 27, 2014 at 3:55 PM, Haoming Zhang haoming.zh...@outlook.com
Hi,
I am running a spark streaming job with 1 minute as the batch size. It ran
around 84 minutes and was killed because of the exception with the
following information:
*java.lang.Exception: Could not compute split, block input-0-1403893740400
not found*
Before it was killed, it was able to
Ok I found dynamic resources, but I have a frustrating problem. This is the
flow:
kafka - enrich X - enrich Y - enrich Z - foreachRDD - save
My problem is: if I do this it's not work, the enrich functions not called,
but if I put a print it's does. for example if I do this:
kafka - enrich X -
So a few quick questions:
1) What cluster are you running this against? Is it just local? Have you
tried local[4]?
2) When you say breakpoint, how are you setting this break point? There is
a good chance your breakpoint mechanism doesn't work in a distributed
environment, could you instead cause
b0c1http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodesuser=1215,
could you post your code? I am interested in your solution.
Thanks
Adrian
From: boci [mailto:boci.b...@gmail.com]
Sent: June-26-14 6:17 PM
To: user@spark.apache.org
Subject: Re:
This is a simply scalatest. I start a SparkConf, set the master to local
(set the serializer etc), pull up kafka and es connection send a message to
kafka and wait 30sec to processing.
It's run in IDEA no magick trick.
b0c1
I switched which machine was the master and which was the dedicated
worker, and now it works just fine. I discovered machine2 is on my
department's DMZ; machine1 is not. I suspect the departmental firewall
was causing problems. By moving the master to machine2, that seems to
have solved my
Try setting the master to local[4]
On Fri, Jun 27, 2014 at 2:17 PM, boci boci.b...@gmail.com wrote:
This is a simply scalatest. I start a SparkConf, set the master to local
(set the serializer etc), pull up kafka and es connection send a message to
kafka and wait 30sec to processing.
It's
1) I'm using the static SVMWithSGD.train, with no options.
2) I have about 20,000 features (~5000 samples) that are being attached and
trained against 14,000 different sets of labels (ie I'll be doing 14,000
different training runs against the same sets of features trying to figure
out which
Dean: Some interesting information... Do you know where I can read more about
these coming changes to Scalding/Cascading?
On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote:
... and to be clear on the point, Summingbird is not limited to MapReduce. It
abstracts over
Sorry. Never mind... I guess that's what Summingbird is all about. Never
heard of it.
On Jun 27, 2014, at 7:10 PM, Marco Shaw marco.s...@gmail.com wrote:
Dean: Some interesting information... Do you know where I can read more about
these coming changes to Scalding/Cascading?
On Jun
DataFlow is based on two papers, MillWheel for Stream processing and
FlumeJava for programming optimization and abstraction.
Millwheel http://research.google.com/pubs/pub41378.html
FlumeJava http://dl.acm.org/citation.cfm?id=1806638
Here is my blog entry on this
Hello,
I have installed spark on top of hadoop + yarn.
when I launch the pyspark shell try to compute something I get this error.
Error from python worker:
/usr/bin/python: No module named pyspark
The pyspark module should be there, do I have to put an external link to it?
--Sanghamitra.
Hi All:
I was wondering if anybody had bought a ticket for the upcoming Spark
Summit 2014 this coming week and had changed their mind about going.
Let me know, since it has sold out and I can't buy a ticket anymore, I
would be interested in buying it.
Best,
--
Cesar Arevalo
Software Engineer ❘
That would be really cool with IPython, But I' still wondering if all
language features are supported, namely I need these 2 in particular:
1. importing class and ILoop from external jars (so I can point it to
SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default
ILoop)
2.
Hi,
According with the research paper bellow of Mathei Zaharia, Spark's creator,
http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf
He says on page 10 that:
Grep is network-bound due to the cost to replicate the input data to
multiple nodes.
So,
I guess a can be a good
A simple throughput test is also repartition()ing a large RDD. This also
stresses the disks, though, so you might try to mount your spark temporary
directory as a ramfs.
On Fri, Jun 27, 2014 at 5:57 PM, danilopds danilob...@gmail.com wrote:
Hi,
According with the research paper bellow of
I know this is a very trivial question to ask but I'm a complete new bee to
this stuff so i don't have ne clue on this. Any help is much appreciated.
For example if i have a class like below, and when i run this through
command line i want to see progress status. some thing like,
10%
The present trunk is built and tested against HBase 0.94.
I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+
and all end up with
14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server
[error] (run-main-0) java.lang.SecurityException: class
Hi There,
There is an issue with PySpark-on-YARN that requires users build with
Java 6. The issue has to do with how Java 6 and 7 package jar files
differently.
Can you try building spark with Java 6 and trying again?
- Patrick
On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote:
Hi,
I have a number of questions using the Kafka receiver of Spark
Streaming. Maybe someone has some more experience with that and can
help me out.
I have set up an environment for getting to know Spark, consisting of
- a Mesos cluster with 3 only-slaves and 3 master-and-slaves,
- 2 Kafka nodes,
59 matches
Mail list logo