i understand that graphx is an immutable rdd.
i'm working on an algorithm that requires a mutable graph. initially, the
graph starts with just a few nodes and edges. then over time, it adds more
and more nodes and edges.
what would be the best way to implement this growing graph with graphx?
Hi, all.
I'm in an unusual situation.
The code,
...
1: val cell = dataSet.flatMap(parse(_)).cache
2: val distinctCell = cell.keyBy(_._1).reduceByKey(removeDuplication(_,
_)).mapValues(_._3).cache
3: val groupedCellByLine =
distinctCell.map(cellToIterableColumn).groupByKey.cache
4: val result = (1
It really depends on the type of the computation. For example, if
vertices and edges are associated with properties and you want to
operate on (vertex-edge-vertex) triplets or use the Pregel API, GraphX
is the way to go. -Xiangrui
On Sat, Oct 4, 2014 at 9:39 PM, ll wrote:
> hi. i am working on a
Arko,
On Sat, Oct 4, 2014 at 1:40 AM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:
>
> Apologies if this is a stupid question but I am trying to understand
> why this can or cannot be done. As far as I understand that streaming
> algorithms need to be different from batch algorithms
Hi,
On Sat, Oct 4, 2014 at 7:32 AM, Andy Davidson wrote:
> Any idea why my email was returned with the following error message?
>
Well, it was classified as spam ;-)
I find it rather interesting, though, that a spam filter tells you exactly
why you were rejected; haven't seen that before.
>
You may also be writing your algorithm in a way that it requires high peak
memory usage. An example of this could be using .groupByKey() where
.reduceByKey() might suffice instead. Maybe you can express the algorithm
in a different way that's more efficient?
On Thu, Oct 2, 2014 at 4:30 AM, Sean
Hi Mingyu,
Maybe we should be limiting our heaps to 32GB max and running multiple
workers per machine to avoid large GC issues.
For a 128GB memory, 32 core machine, this could look like:
SPARK_WORKER_INSTANCES=4
SPARK_WORKER_MEMORY=32
SPARK_WORKER_CORES=8
Are people running with large (32GB+) e
Hi Michael,
I couldn't find anything in Jira for it --
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming
Could you or Adrian please file a Jira ticket explaining the functionality
and maybe a proposed API? This wi
You're putting those into spark-env.sh? Try setting LD_LIBRARY_PATH as
well, that might help.
Also where is the exception coming from? You have to set this properly for
both the cluster and the driver, which are independently set.
Cheers!
Andrew
On Sun, Oct 5, 2014 at 1:06 PM, Tom wrote:
> H
We are currently using Camus for Kafka to HDFS pipeline to store as
SequenceFiles but I understand Spark Streaming can be used to save as
Parquet. As I read about Parquet, the layout is optimized for queries
against large file sizes. Are there any options in Spark to specify the
block size to help
Hi,
I am trying to call some c code, let's say the compiled file is /path/code,
and it has chmod +x. When I call it directly, it works. Now i want to call
it from Spark 1.1. My problem is not building it into Spark, but making sure
Spark can find it.
I have tried:
SPARK_DAEMON_JAVA_OPTS="-Djava.l
Hi Shafaq,
Sorry for the delay, I've created an example project to show how to declare
the dependencies to the plugin.
https://github.com/felixgborrego/sbt-spark-ec2-plugin/tree/master/example-spark-ec2.
You are right there was an issue with the resolver for the ssh lib. I've
also updated the plug
On Sat, Oct 4, 2014 at 5:28 PM, Abraham Jacob wrote:
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.IntWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
Good. There is also a org.apache.hadoop.mapred.TextO
Have you checked reactive monitoring(https://github.com/eigengo/monitor) or
kamon monitoring (https://github.com/kamon-io/Kamon)
Instrumenting needs absolutely no code change. All you do is weaving.
In our environment we use Graphite to get the statsd(you can also get
dtrace) events and display it
14 matches
Mail list logo