graphx - mutable?

2014-10-05 Thread ll
i understand that graphx is an immutable rdd. i'm working on an algorithm that requires a mutable graph. initially, the graph starts with just a few nodes and edges. then over time, it adds more and more nodes and edges. what would be the best way to implement this growing graph with graphx?

Stucked job work well after rdd.count or rdd.collect

2014-10-05 Thread Kevin Jung
Hi, all. I'm in an unusual situation. The code, ... 1: val cell = dataSet.flatMap(parse(_)).cache 2: val distinctCell = cell.keyBy(_._1).reduceByKey(removeDuplication(_, _)).mapValues(_._3).cache 3: val groupedCellByLine = distinctCell.map(cellToIterableColumn).groupByKey.cache 4: val result = (1

Re: mllib sparse vector/matrix vs. graphx graph

2014-10-05 Thread Xiangrui Meng
It really depends on the type of the computation. For example, if vertices and edges are associated with properties and you want to operate on (vertex-edge-vertex) triplets or use the Pregel API, GraphX is the way to go. -Xiangrui On Sat, Oct 4, 2014 at 9:39 PM, ll wrote: > hi. i am working on a

Re: Using GraphX with Spark Streaming?

2014-10-05 Thread Tobias Pfeiffer
Arko, On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > > Apologies if this is a stupid question but I am trying to understand > why this can or cannot be done. As far as I understand that streaming > algorithms need to be different from batch algorithms

Re: problem with user@spark.apache.org spam filter

2014-10-05 Thread Tobias Pfeiffer
Hi, On Sat, Oct 4, 2014 at 7:32 AM, Andy Davidson wrote: > Any idea why my email was returned with the following error message? > Well, it was classified as spam ;-) I find it rather interesting, though, that a spam filter tells you exactly why you were rejected; haven't seen that before. >

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-05 Thread Andrew Ash
You may also be writing your algorithm in a way that it requires high peak memory usage. An example of this could be using .groupByKey() where .reduceByKey() might suffice instead. Maybe you can express the algorithm in a different way that's more efficient? On Thu, Oct 2, 2014 at 4:30 AM, Sean

Re: Larger heap leads to perf degradation due to GC

2014-10-05 Thread Andrew Ash
Hi Mingyu, Maybe we should be limiting our heaps to 32GB max and running multiple workers per machine to avoid large GC issues. For a 128GB memory, 32 core machine, this could look like: SPARK_WORKER_INSTANCES=4 SPARK_WORKER_MEMORY=32 SPARK_WORKER_CORES=8 Are people running with large (32GB+) e

Re: window every n elements instead of time based

2014-10-05 Thread Andrew Ash
Hi Michael, I couldn't find anything in Jira for it -- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming Could you or Adrian please file a Jira ticket explaining the functionality and maybe a proposed API? This wi

Re: java.library.path

2014-10-05 Thread Andrew Ash
You're putting those into spark-env.sh? Try setting LD_LIBRARY_PATH as well, that might help. Also where is the exception coming from? You have to set this properly for both the cluster and the driver, which are independently set. Cheers! Andrew On Sun, Oct 5, 2014 at 1:06 PM, Tom wrote: > H

Kafka->HDFS to store as Parquet format

2014-10-05 Thread bdev
We are currently using Camus for Kafka to HDFS pipeline to store as SequenceFiles but I understand Spark Streaming can be used to save as Parquet. As I read about Parquet, the layout is optimized for queries against large file sizes. Are there any options in Spark to specify the block size to help

java.library.path

2014-10-05 Thread Tom
Hi, I am trying to call some c code, let's say the compiled file is /path/code, and it has chmod +x. When I call it directly, it works. Now i want to call it from Spark 1.1. My problem is not building it into Spark, but making sure Spark can find it. I have tried: SPARK_DAEMON_JAVA_OPTS="-Djava.l

Re: New sbt plugin to deploy jobs to EC2

2014-10-05 Thread Felix Garcia Borrego
Hi Shafaq, Sorry for the delay, I've created an example project to show how to declare the dependencies to the plugin. https://github.com/felixgborrego/sbt-spark-ec2-plugin/tree/master/example-spark-ec2. You are right there was an issue with the resolver for the ssh lib. I've also updated the plug

Re: Spark Streaming writing to HDFS

2014-10-05 Thread Sean Owen
On Sat, Oct 4, 2014 at 5:28 PM, Abraham Jacob wrote: > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.io.IntWritable; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; Good. There is also a org.apache.hadoop.mapred.TextO

Re: Spark Monitoring with Ganglia

2014-10-05 Thread manasdebashiskar
Have you checked reactive monitoring(https://github.com/eigengo/monitor) or kamon monitoring (https://github.com/kamon-io/Kamon) Instrumenting needs absolutely no code change. All you do is weaving. In our environment we use Graphite to get the statsd(you can also get dtrace) events and display it