Spark Streaming to capture packets from interface

2014-06-27 Thread swezzz
Hi.. I am new to Spark . Is it possible to capture live packets from a network interface through spark streaming? Is there a library or any built in classes to bind to the network interface directly? -- View this message in context:

Map with filter on JavaRdd

2014-06-27 Thread ajay garg
Hi All, Is it possible to map and filter a javardd in a single operation? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Map-with-filter-on-JavaRdd-tp8401.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Map with filter on JavaRdd

2014-06-27 Thread Mayur Rustagi
It happens in a single operation itself. You may write it separately but the stages are performed together if its possible. You will see only one task in the output of your application. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-06-27 Thread MEETHU MATHEW
Hi Akhil, The IP is correct and is able to start the workers when we start it as a java command.Its becoming 192.168.125.174:0  when we call from the scripts.   Thanks Regards, Meethu M On Friday, 27 June 2014 1:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to

Issue in using classes with constructor as vertex attribute in graphx

2014-06-27 Thread harsh2005_7
Hi, I have a scenario where I am having a class X with constructor parameter as (RDD,Double).When I am initializing the the class object with corresponding RDD and double value (of name say x1) and putting it as a vertex attribute in graph , I am losing my RDD value . The Double value remains

Re: [GraphX] Cast error when comparing a vertex attribute after its type has changed

2014-06-27 Thread Pierre-Alexandre Fonta
Thanks for having corrected this bug! The fix version is marked as 1.1.0 ( SPARK-1552 https://issues.apache.org/jira/browse/SPARK-1552 ). I have tested my code snippet with Spark 1.0.0 (Scala 2.10.4) and it works. I don't know if it's important to mention it. Pierre-Alexandre -- View this

Re: Map with filter on JavaRdd

2014-06-27 Thread ajay garg
Thanks Mayur for clarification.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Map-with-filter-on-JavaRdd-tp8401p8410.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Improving Spark multithreaded performance?

2014-06-27 Thread anoldbrain
I have not used this, only watched a presentation of it in spark summit 2013. https://github.com/radlab/sparrow https://spark-summit.org/talk/ousterhout-next-generation-spark-scheduling-with-sparrow/ Pure conjecture from your high scheduling latency and the size of your cluster, it seems one way

Re: Fine-grained mesos execution hangs on Debian 7.4

2014-06-27 Thread Fedechicco
Hello Sebastien, it is not working with the 1.0 branch either. I decided to compile spark from source precisely because of the [SPARK-2204] fix, because before that I couldn't get fine-grained working at all. Now it works fine if the cluster is only composed of Ubuntu 14.04 nodes, and when I

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Martin Goodson
My experience is that gaining 20 spot instances accounts for a tiny fraction of the total time of provisioning a cluster with spark-ec2. This is not (solely) an AWS issue. -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Thu, Jun 26, 2014 at 10:14 PM, Nicholas

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Sean Owen
On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia buendia...@gmail.com wrote: Summingbird is for map/reduce. Dataflow is the third generation of google's map/reduce, and it generalizes map/reduce the way Spark does. See more about this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s Yes, my point

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Dean Wampler
... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over Scalding (which abstracts over Cascading, which is being moved from MR to Spark) and over Storm for event processing. On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen so...@cloudera.com wrote: On Thu, Jun

How to use .newAPIHadoopRDD() from Java (w/ Cassandra)

2014-06-27 Thread Martin Gammelsæter
Hello! I have just started trying out Spark to see if it fits my needs, but I am running into some issues when trying to port the CassandraCQLTest.scala example into Java. The specific errors etc. that I encounter can be seen here:

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
I put the settings as you specified in spark-env.sh for the master. When I run start-all.sh, the web UI shows both the worker on the master (machine1) and the slave worker (machine2) as ALIVE and ready, with the master URL at spark://192.168.1.101. However, when I run spark-submit, it

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
No joy, unfortunately. Same issue; see my previous email--still crashes with address already in use. On 6/27/14, 1:54 AM, sujeetv wrote: Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context:

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*, exactly as configured. On 6/27/14, 9:07 AM, Shannon Quinn wrote: I put the settings as you specified in spark-env.sh for the master. When I run start-all.sh, the web UI shows both the worker on the master (machine1) and

Spark RDD member of class loses it's value when the class being used as graph attribute

2014-06-27 Thread harsh2005_7
Hi, I have a scenario where I am having a class X with constructor parameter as (RDD,Double).When I am initializing the the class object with corresponding RDD and double value (of name say x1) and *putting it as a vertex attribute in graph* , I am losing my RDD value . The Double value remains

problem when start spark streaming in cluster mode

2014-06-27 Thread Siyuan he
Hi all, I can start a spark streaming app in Client mode on a Pseudo-standalone cluster on my local machine. However when I tried to start it in Cluster mode. It always get the following exception on the Driver. Exception in thread main akka.ConfigurationException: Could not start logger due to

Re: ElasticSearch enrich

2014-06-27 Thread boci
Another question. In the foreachRDD I will initialize the JobConf, but in this place how can I get information from the items? I have an identifier in the data which identify the required ES index (so how can I set dynamic index in the foreachRDD) ? b0c1

Re: numpy + pyspark

2014-06-27 Thread Avishek Saha
I too felt the same Nick but I don't have root privileges on the cluster, unfortunately. Are there any alternatives? On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com wrote: I've not tried this - but numpy is a tricky and complex package with many dependencies on Fortran/C

Re: numpy + pyspark

2014-06-27 Thread Shannon Quinn
Would deploying virtualenv on each directory on the cluster be viable? The dependencies would get tricky but I think this is the sort of situation it's built for. On 6/27/14, 11:06 AM, Avishek Saha wrote: I too felt the same Nick but I don't have root privileges on the cluster, unfortunately.

Re: numpy + pyspark

2014-06-27 Thread Shannon Quinn
I suppose along those lines, there's also Anaconda: https://store.continuum.io/cshop/anaconda/ On 6/27/14, 11:13 AM, Nick Pentreath wrote: Hadoopy uses http://www.pyinstaller.org/ to package things up into an executable that should be runnable without root privileges. It says it support numpy

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
I got an answer on SO on this question, basically confirming that the CQLSSTableWrite cannot be used in Spark (at least in the form shown in the code snippet). DataStax filed a bug on that and might get solved on a future version. As you have observed, a single writer can only be used in serial

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-27 Thread Gerard Maas
Hi Rohit, Thanks for your message. We are currently on Spark 0.9.1, Cassandra 2.0.6 and Calliope GA (Would love to try the pre-release version if you want beta testers :-) Our hadoop version is CDH4.4 and of course our spark assembly is compiled against it. We have got really interesting

Re: Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-06-27 Thread Romain Rigaux
So far Spark Job Server does not work with Spark 1.0: https://github.com/ooyala/spark-jobserver So this works only with Spark 0.9 currently: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ Romain Romain On Tue, Jun 24, 2014 at 9:04 AM,

Re: Map with filter on JavaRdd

2014-06-27 Thread Daniel Siegmann
If for some reason it would be easier to do your mapping and filtering in a single function, you can also use RDD.flatMap (returning an empty sequence is equivalent to a filter). But unless you have good reason you should have a separate map and filter transform, as Mayur said. On Fri, Jun 27,

problem when start spark streaming in cluster mode

2014-06-27 Thread Siyuan he
Hi all, I can start a spark streaming app in Client mode on a Pseudo-standalone cluster on my local machine. However when I tried to start it in Cluster mode. It always got the following exception on the Driver. Exception in thread main akka.ConfigurationException: Could not start logger due to

scopt.OptionParser

2014-06-27 Thread SK
Hi, I tried to develop some code to use Logistic Regression, following the code in BinaryClassification.scala in examples/mllib. My code compiles, but at runtime complains that scopt/OptionParser class cannot be found. I have the following import statement in my code: import scopt.OptionParser

Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
Hello Mayur, Are you using SparkListener interface java API? I tried using it but was unsuccessful. So need few more inputs. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8438.html Sent from the Apache Spark User List mailing

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
For some reason, commenting out spark.driver.host and spark.driver.port fixed something...and broke something else (or at least revealed another problem). For reference, the only lines I have in my spark-defaults.conf now: spark.app.name myProg spark.master

Re: Spark standalone network configuration problems

2014-06-27 Thread Sujeet Varakhedi
Looks like your driver is not able to connect to the remote executor on machine2/130.49.226.148:60949. Cn you check if the master machine can route to 130.49.226.148 Sujeet On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu wrote: For some reason, commenting out

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
Apologies; can you advise as to how I would check that? I can certainly SSH from master to machine2. On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote: Looks like your driver is not able to connect to the remote executor on machine2/130.49.226.148:60949 http://130.49.226.148:60949/. Cn you check

RE: problem when start spark streaming in cluster mode

2014-06-27 Thread Haoming Zhang
Hi Siyuan, Can you try this solution? http://stackoverflow.com/questions/21943353/akka-2-3-0-fails-to-load-slf4jeventhandler-class-with-java-lang-classnotfounde Best Date: Fri, 27 Jun 2014 14:18:59 -0400 Subject: problem when start spark streaming in cluster mode From: hsy...@gmail.com To:

jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-27 Thread M Singh
Hi: I am using spark to stream data to cassandra and it works fine in local mode. But when I execute the application in a standalone clustered env I got exception included below (java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass). I think this is due to the

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-27 Thread Peng Cheng
I give up, communication must be blocked by the complex EC2 network topology (though the error information indeed need some improvement). It doesn't make sense to run a client thousands miles away to communicate frequently with workers. I have moved everything to EC2 now. -- View this message

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Xiangrui Meng
Hi Kyle, A few questions: 1) Did you use `setIntercept(true)`? 2) How many features? I'm a little worried about driver's load because the final aggregation and weights update happen on the driver. Did you check driver's memory usage as well? Best, Xiangrui On Fri, Jun 27, 2014 at 8:10 AM,

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-27 Thread Xiangrui Meng
Try to use --executor-memory 12g with spark-summit. Or you can set it in conf/spark-defaults.properties and rsync it to all workers and then restart. -Xiangrui On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng pc...@uow.edu.au wrote: I give up, communication must be blocked by the complex EC2 network

Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
This will be handy for demo and quick prototyping as the command-line REPL doesn't support a lot of editor features, also, you don't need to ssh into your worker/master if your client is behind an NAT wall. Since Spark codebase has a minimalistic design philosophy I don't think this component can

Re: problem when start spark streaming in cluster mode

2014-06-27 Thread Siyuan he
Hey Haoming, Actually akka.loggers has already been set to akka.event.slf4j.Slf4jLogger. You can check https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala Regards, SY On Fri, Jun 27, 2014 at 3:55 PM, Haoming Zhang haoming.zh...@outlook.com

Could not compute split, block not found

2014-06-27 Thread Bill Jay
Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: *java.lang.Exception: Could not compute split, block input-0-1403893740400 not found* Before it was killed, it was able to

Re: ElasticSearch enrich

2014-06-27 Thread boci
Ok I found dynamic resources, but I have a frustrating problem. This is the flow: kafka - enrich X - enrich Y - enrich Z - foreachRDD - save My problem is: if I do this it's not work, the enrich functions not called, but if I put a print it's does. for example if I do this: kafka - enrich X -

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
So a few quick questions: 1) What cluster are you running this against? Is it just local? Have you tried local[4]? 2) When you say breakpoint, how are you setting this break point? There is a good chance your breakpoint mechanism doesn't work in a distributed environment, could you instead cause

RE: ElasticSearch enrich

2014-06-27 Thread Adrian Mocanu
b0c1http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodesuser=1215, could you post your code? I am interested in your solution. Thanks Adrian From: boci [mailto:boci.b...@gmail.com] Sent: June-26-14 6:17 PM To: user@spark.apache.org Subject: Re:

Re: ElasticSearch enrich

2014-06-27 Thread boci
This is a simply scalatest. I start a SparkConf, set the master to local (set the serializer etc), pull up kafka and es connection send a message to kafka and wait 30sec to processing. It's run in IDEA no magick trick. b0c1

Re: Spark standalone network configuration problems

2014-06-27 Thread Shannon Quinn
I switched which machine was the master and which was the dedicated worker, and now it works just fine. I discovered machine2 is on my department's DMZ; machine1 is not. I suspect the departmental firewall was causing problems. By moving the master to machine2, that seems to have solved my

Re: ElasticSearch enrich

2014-06-27 Thread Holden Karau
Try setting the master to local[4] On Fri, Jun 27, 2014 at 2:17 PM, boci boci.b...@gmail.com wrote: This is a simply scalatest. I start a SparkConf, set the master to local (set the serializer etc), pull up kafka and es connection send a message to kafka and wait 30sec to processing. It's

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Kyle Ellrott
1) I'm using the static SVMWithSGD.train, with no options. 2) I have about 20,000 features (~5000 samples) that are being attached and trained against 14,000 different sets of labels (ie I'll be doing 14,000 different training runs against the same sets of features trying to figure out which

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote: ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Sorry. Never mind... I guess that's what Summingbird is all about. Never heard of it. On Jun 27, 2014, at 7:10 PM, Marco Shaw marco.s...@gmail.com wrote: Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Khanderao Kand
DataFlow is based on two papers, MillWheel for Stream processing and FlumeJava for programming optimization and abstraction. Millwheel http://research.google.com/pubs/pub41378.html FlumeJava http://dl.acm.org/citation.cfm?id=1806638 Here is my blog entry on this

hadoop + yarn + spark

2014-06-27 Thread sdeb
Hello, I have installed spark on top of hadoop + yarn. when I launch the pyspark shell try to compute something I get this error. Error from python worker: /usr/bin/python: No module named pyspark The pyspark module should be there, do I have to put an external link to it? --Sanghamitra.

Anybody changed their mind about going to the Spark Summit 2014

2014-06-27 Thread Cesar Arevalo
Hi All: I was wondering if anybody had bought a ticket for the upcoming Spark Summit 2014 this coming week and had changed their mind about going. Let me know, since it has sold out and I can't buy a ticket anymore, I would be interested in buying it. Best, -- Cesar Arevalo Software Engineer ❘

Re: Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
That would be really cool with IPython, But I' still wondering if all language features are supported, namely I need these 2 in particular: 1. importing class and ILoop from external jars (so I can point it to SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default ILoop) 2.

Re: Interconnect benchmarking

2014-06-27 Thread danilopds
Hi, According with the research paper bellow of Mathei Zaharia, Spark's creator, http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf He says on page 10 that: Grep is network-bound due to the cost to replicate the input data to multiple nodes. So, I guess a can be a good

Re: Interconnect benchmarking

2014-06-27 Thread Aaron Davidson
A simple throughput test is also repartition()ing a large RDD. This also stresses the disks, though, so you might try to mount your spark temporary directory as a ramfs. On Fri, Jun 27, 2014 at 5:57 PM, danilopds danilob...@gmail.com wrote: Hi, According with the research paper bellow of

Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
I know this is a very trivial question to ask but I'm a complete new bee to this stuff so i don't have ne clue on this. Any help is much appreciated. For example if i have a class like below, and when i run this through command line i want to see progress status. some thing like, 10%

HBase 0.96+ with Spark 1.0+

2014-06-27 Thread Stephen Boesch
The present trunk is built and tested against HBase 0.94. I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+ and all end up with 14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class

Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell
Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote:

Distribute data from Kafka evenly on cluster

2014-06-27 Thread Tobias Pfeiffer
Hi, I have a number of questions using the Kafka receiver of Spark Streaming. Maybe someone has some more experience with that and can help me out. I have set up an environment for getting to know Spark, consisting of - a Mesos cluster with 3 only-slaves and 3 master-and-slaves, - 2 Kafka nodes,