Re: mLIb solving linear regression with sparse inputs

2018-11-05 Thread Robineast
Well I did eventually write this code in Java, and it was very long! see https://github.com/insidedctm/sparse-linear-regression - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: GraphX subgraph from list of VertexIds

2017-05-12 Thread Robineast
it would be listVertices.contains(vid) wouldn't it? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
>From the section on Pregel API in the GraphX programming guide: '... the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction /constrained to the topology of the graph/.'. Does that answer your question? Did you read the programming guide? - Robin East Spark

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
GraphX is not synonymous with Pregel. To quote the GraphX programming guide 'GraphX exposes a variant of the Pregel API.'. There is no compute() function in GraphX - see the Pregel API section of the programming

Re: GraphX Pregel API: add vertices and edges

2017-03-23 Thread Robineast
Not that I'm aware of. Where did you read that? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Graphx triplet comparison

2016-12-14 Thread Robineast
You are trying to invoke 1 RDD action inside another, that won't work. If you want to do what you are attempting you need to .collect() each triplet to the driver and iterate over that. HOWEVER you almost certainly don't want to do that, not if your data are anything other than a trivial size. In

Re: Graphx triplet comparison

2016-12-13 Thread Robineast
No sure what you are asking. What's wrong with: triplet1.filter(condition3) triplet2.filter(condition3) - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Does SparkR or SparkMLib support nonlinear optimization with non linear constraints

2016-11-25 Thread Robineast
I provided an answer to a similar question here: https://www.mail-archive.com/user@spark.apache.org/msg57697.html --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: GraphX Connected Components

2016-11-08 Thread Robineast
Have you tried this? https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in

Re: mLIb solving linear regression with sparse inputs

2016-11-06 Thread Robineast
Here’s a way of creating sparse vectors in MLLib: import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble)) val pairRdd: RDD[(Int, (Int, Int, Double))]

Re: mLIb solving linear regression with sparse inputs

2016-11-03 Thread Robineast
Any reason why you can’t use built in linear regression e.g. http://spark.apache.org/docs/latest/ml-classification-regression.html#regression or http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression?

Re: Large-scale matrix inverse in Spark

2016-09-29 Thread Robineast
The paper you mention references a Spark-based LU decomposition approach. AFAIK there is no current implementation in Spark but there is a JIRA open (https://issues.apache.org/jira/browse/SPARK-8514 ) that covers this - seems to have gone quiet

Re: How to modify collection inside a spark rdd foreach

2016-06-06 Thread Robineast
It's not that clear what you are trying to achieve - what type is myRDD and where do k and v come from? Anyway it seems you want to end up with a map or a dictionary which is what PairRDD is for e.g. val rdd = sc.makeRDD(Array("1","2","3")) val pairRDD = rdd.map(el => (el.toInt, el)) -

Re: Various Apache Spark's deployment problems

2016-04-29 Thread Robineast
Do you need 2 --num-executors ? Sent from my iPhone > On 29 Apr 2016, at 20:25, Ashish Sharma [via Apache Spark User List] > wrote: > > Submit Command1: > > spark-submit --class working.path.to.Main \ > --master yarn \ >

Re: How to use graphx to partition a graph which could assign topologically-close vertices on a same machine?

2016-03-09 Thread Robineast
In GraphX partitioning relates to edges not to vertices - vertices are partitioned however the RDD that was used to create the graph was partitioned. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: Spark standalone peer2peer network

2016-02-23 Thread Robineast
Hi Thomas I can confirm that I have had this working in the past. I'm pretty sure you don't need password-less SSH for running a standalone cluster manually. Try running the instructions at http://spark.apache.org/docs/latest/spark-standalone.html for Starting a Cluster manually. do you get the

Re: Where to implement synchronization is GraphX Pregel API

2015-12-07 Thread Robineast
Not sure exactly what your asking but: 1) if you are asking do you need to implement synchronisation code - no that is built into the call to Pregel 2) if you are asking how is synchronisation implemented in GraphX - the superstep starts and ends with the beginning and end of a while loop in the

Re: Failing to execute Pregel shortest path on 22k nodes

2015-12-01 Thread Robineast
1. The for loop is executed in your driver program so will send each Pregel request serially to be executed on the cluster 2. Whilst caching/persisting may improve the runtime it shouldn't affect the memory bounds - if you ask to cache more than is available then cached RDDs will be dropped out of

Re: GraphX - How to make a directed graph an undirected graph?

2015-11-26 Thread Robineast
1. GraphX doesn't have a concept of undirected graphs, Edges are always specified with a srcId and dstId. However there is nothing to stop you adding in edges that point in the other direction i.e. if you have an edge with srcId -> dstId you can add an edge dstId -> srcId 2. In general APIs will

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-10-23 Thread Robineast
Both Spark 1.5 and 1.5.1 are released so it certainly shouldn't be a problem - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread Robineast
What you have done should work. A couple of things to try: 1) you should have a lib directory in your Spark deployment, it should have a jar file called lib/spark-assembly-1.5.1-hadoop2.6.0.jar. Is it there? 2) Have you set the JAVA_HOME variable to point to your java8 deployment? If not try

Re: Constant Spark execution time with different # of slaves

2015-10-10 Thread Robineast
Do you have enough partitions of your RDDs to spread across all your processing cores? Are all executors actually processing tasks? - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this

Re: Spark GraphaX

2015-10-10 Thread Robineast
Well it depends on exactly what algorithms are involved in Network Root Cause analysis (not something I'm familiar with). GraphX provides a number of out of the box algorithms like PageRank, connected components, strongly connected components, label propagation as well as an implementation of the

Re: Checkpointing in Iterative Graph Computation

2015-10-10 Thread Robineast
One other thought - you need to call SparkContext.setCheckpointDir otherwise nothing will happen - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context:

Re: Checkpointing in Iterative Graph Computation

2015-10-10 Thread Robineast
You need to checkpoint before you materialize. You'll find you probably only want to checkpoint every 100 or so iterations otherwise the checkpointing will slow down your application excessively - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX doesn't implement Tinkerpop functionality but there is an external effort to provide an implementation. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4279 - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX has a Shortest Paths algorithm implementation which will tell you, for all vertices in the graph, the shortest distance to a specific ('landmark') vertex. The returned value is '/a graph where each vertex attribute is a map containing the shortest-path distance to each reachable landmark

Re: Standalone Scala Project

2015-10-01 Thread Robineast
I've eyeballed the sbt file and it look ok to me Try sbt clean package that should sort it out. If not please supply the full code you are running - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: How to find how much data will be train in mllib or how much the spark job is completed ?

2015-09-29 Thread Robineast
This page gives details on the monitoring available http://spark.apache.org/docs/latest/monitoring.html. You can get a UI showing Jobs, Stages and Tasks with an indication how far completed the job is. The UI is usually on port 4040 of the machine where you run the spark driver program. The

Re: How to find how much data will be train in mllib or how much the spark job is completed ?

2015-09-29 Thread Robineast
so you could query the rest api in code. E.g. /applications//stages provides details on the number of active and completed tasks in each stage - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action --

Spark mailing list confusion

2015-09-29 Thread Robineast
Does anyone have any idea why some topics on the mailing list end up on https://www.mail-archive.com/user@spark.apache.org e.g. this message thread , but not on http://apache-spark-user-list.1001560.n3.nabble.com ? Whilst I get

Re: Distance metrics in KMeans

2015-09-26 Thread Robineast
There is a Spark Package that gives some alternative distance metrics, http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Not used it myself. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: GraphX create graph with multiple node attributes

2015-09-26 Thread Robineast
Vertices that aren't connected to anything are perfectly valid e.g. import org.apache.spark.graphx._ val vertices = sc.makeRDD(Seq((1L,1),(2L,1),(3L,1))) val edges = sc.makeRDD(Seq(Edge(1L,2L,1))) val g = Graph(vertices, edges) g.vertices.count gives 3 Not sure why vertices appear to be

Re: Calling a method parallel

2015-09-23 Thread Robineast
The following should give you what you need: val results = sc.makeRDD(1 to n).map(X(_)).collect This should return the results as an array. _ Robin East Spark GraphX in Action - Michael Malak and Robin East Manning Publications

Re: why when I double the number of workers, ml LogisticRegression fitting time is not reduced in half?

2015-09-16 Thread Robineast
In principle yes, however it depends on whether your application is actually utilising the extra resources. Use the Task metrics available in the application UI (usually available from the driver machine on port 4040) to find out. -- Robin East Spark GraphX

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Robineast
Do you have LZO configured? see http://stackoverflow.com/questions/14808041/how-to-have-lzo-compression-in-hadoop-mapreduce --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Re: Graphx CompactBuffer help

2015-08-28 Thread Robineast
my previous reply got mangled This should work: coon.filter(x = x.exists(el = Seq(1,15).contains(el))) CompactBuffer is a specialised form of a Scala Iterator --- Robin East Spark GraphX in Action Michael Malak and

Re: Spark GraphaX

2015-08-23 Thread Robineast
GrapX is a graph analytics engine rather than a graph database. It's typical use case is running large-scale graph algorithms like page rank , connected components, label propagation and so on. It can be an element of complex processing pipelines that involve other Spark components such as Data

Re: what determine the task size?

2015-08-21 Thread Robineast
The OP wants to understand what determines the size of the task code that is shipped to each executor so it can run the task. I don't know the answer to but would be interested to know too. Sent from my iPhone On 21 Aug 2015, at 08:26, oubrik [via Apache Spark User List]

Re: Saving and loading MLlib models as standalone (no Hadoop)

2015-08-20 Thread Robineast
You can't serialize models out of Spark and then use them outside of the Spark context. However there is support for the PMML format - have a look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html Robin

Re: how to write any data (non RDD) to a file inside closure?

2015-08-18 Thread Robineast
Still not sure what you are trying to achieve. If you could post some code that doesn’t work the community can help you understand where the error (syntactic or conceptual) is. On 17 Aug 2015, at 17:42, dianweih001 [via Apache Spark User List] ml-node+s1001560n24299...@n3.nabble.com wrote:

Re: SparkR -Graphx Connected components

2015-08-11 Thread Robineast
To be part of a strongly connected component every vertex must be reachable from every other vertex. Vertex 6 is not reachable from the other components of scc 0. Same goes for 7. So both 6 and 7 form their own strongly connected components. 6 and 7 are part of the connected components of 0 and 3

Re: SparkR -Graphx Connected components

2015-08-07 Thread Robineast
Hi The graph returned by SCC (strong_graphs in your code) has vertex data where each vertex in a component is assigned the lowest vertex id of the component. So if you have 6 vertices (1 to 6) and 2 strongly connected components (1 and 3, and 2,4,5 and 6) then the strongly connected components

Re: Scala problem when using g.vertices.map not a member of type parameter

2015-06-29 Thread Robineast
I can't see an obvious problem. Could you post the full minimal code that reproduces the problem? Also why version of Spark and Scala are you using? -- View this message in context:

Re: java.lang.UnsupportedOperationException: empty collection

2015-04-28 Thread Robineast
I've tried running your code through spark-shell on both 1.3.0 (pre-built for Hadoop 2.4 and above) and a recently built snapshot of master. Both work fine. Running on OS X yosemite. What's your configuration? -- View this message in context:

Re: DAG info

2015-01-02 Thread Robineast
Do you have some example code of what you are trying to do? Robin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DAG-info-tp20940p20941.html Sent from the Apache Spark User List mailing list archive at Nabble.com.