RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
With that said, and the nature of iterative algorithms that Spark is advertised 
for, isn't this a bit of an unnecessary restriction since I don't see where the 
problem is. For instance, it is clear that when aggregating you need operations 
to be associative because of the way they are divided and combined. But since 
forEach works on an individual item the same problem doesn't exist. 
As an example, during a k-means algorithm you have to continually update 
cluster assignments per data item along with perhaps distance from centroid.  
So if you can't update items in place you have to literally create thousands 
upon thousands of RDDs. Does Spark have some kind of trick like reuse behind 
the scenes - fully persistent data objects or whatever. How can it possibly be 
efficient for 'iterative' algorithms when it is creating so many RDDs as 
opposed to one? 

 From: so...@cloudera.com
 Date: Fri, 5 Dec 2014 14:58:37 -0600
 Subject: Re: Java RDD Union
 To: ronalday...@live.com; user@spark.apache.org
 
 foreach also creates a new RDD, and does not modify an existing RDD.
 However, in practice, nothing stops you from fiddling with the Java
 objects inside an RDD when you get a reference to them in a method
 like this. This is definitely a bad idea, as there is certainly no
 guarantee that any other operations will see any, some or all of these
 edits.
 
 On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub ronalday...@live.com wrote:
  I tricked myself into thinking it was uniting things correctly. I see I'm
  wrong now.
 
  I have a question regarding your comment that RDD are immutable. Can you
  change values in an RDD using forEach. Does that violate immutability. I've
  been using forEach to modify RDD but perhaps I've tricked myself once again
  into believing it is working. I have object reference so perhaps it is
  working serendipitously in local mode since the references are in fact not
  changing but there are referents are and somehow this will no longer work
  when clustering.
 
  Thanks for comments.
 
  From: so...@cloudera.com
  Date: Fri, 5 Dec 2014 14:22:38 -0600
  Subject: Re: Java RDD Union
  To: ronalday...@live.com
  CC: user@spark.apache.org
 
 
  No, RDDs are immutable. union() creates a new RDD, and does not modify
  an existing RDD. Maybe this obviates the question. I'm not sure what
  you mean about releasing from memory. If you want to repartition the
  unioned RDD, you repartition the result of union(), not anything else.
 
  On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote:
   I'm a bit confused regarding expected behavior of unions. I'm running on
   8
   cores. I have an RDD that is used to collect cluster associations
   (cluster
   id, content id, distance) for internal clusters as well as leaf clusters
   since I'm doing hierarchical k-means and need all distances for sorting
   documents appropriately upon examination.
  
   It appears that Union simply adds items in the argument to the RDD
   instance
   the method is called on rather than just returning a new RDD. If I want
   to
   do Union this was as more of an add/append should I be capturing the
   return
   value and releasing it from memory. Need help clarifying the semantics
   here.
  
   Also, in another related thread someone mentioned coalesce after union.
   Would I need to do the same on the instance RDD I'm calling Union on.
  
   Perhaps a method such as append would be useful and clearer.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Re: Java RDD Union

2014-12-06 Thread Sean Owen
I guess a major problem with this is that you lose fault tolerance.
You have no way of recreating the local state of the mutable RDD if a
partition is lost.

Why would you need thousands of RDDs for kmeans? it's a few per iteration.

An RDD is more bookkeeping that data structure, itself. They don't
inherently take up resource, unless you mark them to be persisted.
You're paying the cost of copying objects to create one RDD from next,
but that's mostly it.

On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub ronalday...@live.com wrote:
 With that said, and the nature of iterative algorithms that Spark is
 advertised for, isn't this a bit of an unnecessary restriction since I don't
 see where the problem is. For instance, it is clear that when aggregating
 you need operations to be associative because of the way they are divided
 and combined. But since forEach works on an individual item the same problem
 doesn't exist.

 As an example, during a k-means algorithm you have to continually update
 cluster assignments per data item along with perhaps distance from centroid.
 So if you can't update items in place you have to literally create thousands
 upon thousands of RDDs. Does Spark have some kind of trick like reuse behind
 the scenes - fully persistent data objects or whatever. How can it possibly
 be efficient for 'iterative' algorithms when it is creating so many RDDs as
 opposed to one?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
Hiearchical K-means require a massive amount of iterations whereas flat K-means 
does not but I've found flat to be generally useless since in most UIs it is 
nice to be able to drill down into more and more specific clusters. If you have 
100 million documents and your branching factor is 8 (8-secting k-means) then 
you will be picking a cluster to split and iterating thousands of times. So per 
split you iterate maybe 6 or 7 times to get new cluster assignments and there 
are ultimately going to be 5,000 to 50,000 splits depending on split criterion 
and cluster variances etc... 
In this case fault tolerance doesn't matter. I've found that the distributed 
aspect of RDD is what I'm looking for and don't care or need the resilience 
part as much. It is a one off algorithm and that can just be run again if 
something goes wrong. Once the data is created it is done with Spark. 
But anyway, that is the very thing Spark is advertised for. 

 From: so...@cloudera.com
 Date: Sat, 6 Dec 2014 06:39:10 -0600
 Subject: Re: Java RDD Union
 To: ronalday...@live.com
 CC: user@spark.apache.org
 
 I guess a major problem with this is that you lose fault tolerance.
 You have no way of recreating the local state of the mutable RDD if a
 partition is lost.
 
 Why would you need thousands of RDDs for kmeans? it's a few per iteration.
 
 An RDD is more bookkeeping that data structure, itself. They don't
 inherently take up resource, unless you mark them to be persisted.
 You're paying the cost of copying objects to create one RDD from next,
 but that's mostly it.
 
 On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub ronalday...@live.com wrote:
  With that said, and the nature of iterative algorithms that Spark is
  advertised for, isn't this a bit of an unnecessary restriction since I don't
  see where the problem is. For instance, it is clear that when aggregating
  you need operations to be associative because of the way they are divided
  and combined. But since forEach works on an individual item the same problem
  doesn't exist.
 
  As an example, during a k-means algorithm you have to continually update
  cluster assignments per data item along with perhaps distance from centroid.
  So if you can't update items in place you have to literally create thousands
  upon thousands of RDDs. Does Spark have some kind of trick like reuse behind
  the scenes - fully persistent data objects or whatever. How can it possibly
  be efficient for 'iterative' algorithms when it is creating so many RDDs as
  opposed to one?
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 
cores. I have an RDD that is used to collect cluster associations (cluster id, 
content id, distance) for internal clusters as well as leaf clusters since I'm 
doing hierarchical k-means and need all distances for sorting documents 
appropriately upon examination. 
It appears that Union simply adds items in the argument to the RDD instance the 
method is called on rather than just returning a new RDD. If I want to do Union 
this was as more of an add/append should I be capturing the return value and 
releasing it from memory. Need help clarifying the semantics here. 
Also, in another related thread someone mentioned coalesce after union. Would I 
need to do the same on the instance RDD I'm calling Union on. 
Perhaps a method such as append would be useful and clearer.
  

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify
an existing RDD. Maybe this obviates the question. I'm not sure what
you mean about releasing from memory. If you want to repartition the
unioned RDD, you repartition the result of union(), not anything else.

On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote:
 I'm a bit confused regarding expected behavior of unions. I'm running on 8
 cores. I have an RDD that is used to collect cluster associations (cluster
 id, content id, distance) for internal clusters as well as leaf clusters
 since I'm doing hierarchical k-means and need all distances for sorting
 documents appropriately upon examination.

 It appears that Union simply adds items in the argument to the RDD instance
 the method is called on rather than just returning a new RDD. If I want to
 do Union this was as more of an add/append should I be capturing the return
 value and releasing it from memory. Need help clarifying the semantics here.

 Also, in another related thread someone mentioned coalesce after union.
 Would I need to do the same on the instance RDD I'm calling Union on.

 Perhaps a method such as append would be useful and clearer.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron,

Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.

Here's a quick test:

scala val firstRDD = sc.parallelize(1 to 5)
firstRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at
parallelize at console:12

scala val secondRDD = sc.parallelize(1 to 3)
secondRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at
parallelize at console:12

scala firstRDD.collect()
res1: Array[Int] = Array(1, 2, 3, 4, 5)

scala secondRDD.collect()
res2: Array[Int] = Array(1, 2, 3)

scala val newRDD = firstRDD.union(secondRDD)
newRDD: org.apache.spark.rdd.RDD[Int] = UnionRDD[4] at union at console:16

scala newRDD.collect()
res3: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3)

scala firstRDD.collect()
res4: Array[Int] = Array(1, 2, 3, 4, 5)

scala secondRDD.collect()
res5: Array[Int] = Array(1, 2, 3)


On Fri, Dec 5, 2014 at 2:27 PM, Ron Ayoub ronalday...@live.com wrote:

 I'm a bit confused regarding expected behavior of unions. I'm running on 8
 cores. I have an RDD that is used to collect cluster associations (cluster
 id, content id, distance) for internal clusters as well as leaf clusters
 since I'm doing hierarchical k-means and need all distances for sorting
 documents appropriately upon examination.

 It appears that Union simply adds items in the argument to the RDD
 instance the method is called on rather than just returning a new RDD. If I
 want to do Union this was as more of an add/append should I be capturing
 the return value and releasing it from memory. Need help clarifying the
 semantics here.

 Also, in another related thread someone mentioned coalesce after union.
 Would I need to do the same on the instance RDD I'm calling Union on.

 Perhaps a method such as append would be useful and clearer.



Re: Java RDD Union

2014-12-05 Thread Sean Owen
foreach also creates a new RDD, and does not modify an existing RDD.
However, in practice, nothing stops you from fiddling with the Java
objects inside an RDD when you get a reference to them in a method
like this. This is definitely a bad idea, as there is certainly no
guarantee that any other operations will see any, some or all of these
edits.

On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub ronalday...@live.com wrote:
 I tricked myself into thinking it was uniting things correctly. I see I'm
 wrong now.

 I have a question regarding your comment that RDD are immutable. Can you
 change values in an RDD using forEach. Does that violate immutability. I've
 been using forEach to modify RDD but perhaps I've tricked myself once again
 into believing it is working. I have object reference so perhaps it is
 working serendipitously in local mode since the references are in fact not
 changing but there are referents are and somehow this will no longer work
 when clustering.

 Thanks for comments.

 From: so...@cloudera.com
 Date: Fri, 5 Dec 2014 14:22:38 -0600
 Subject: Re: Java RDD Union
 To: ronalday...@live.com
 CC: user@spark.apache.org


 No, RDDs are immutable. union() creates a new RDD, and does not modify
 an existing RDD. Maybe this obviates the question. I'm not sure what
 you mean about releasing from memory. If you want to repartition the
 unioned RDD, you repartition the result of union(), not anything else.

 On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub ronalday...@live.com wrote:
  I'm a bit confused regarding expected behavior of unions. I'm running on
  8
  cores. I have an RDD that is used to collect cluster associations
  (cluster
  id, content id, distance) for internal clusters as well as leaf clusters
  since I'm doing hierarchical k-means and need all distances for sorting
  documents appropriately upon examination.
 
  It appears that Union simply adds items in the argument to the RDD
  instance
  the method is called on rather than just returning a new RDD. If I want
  to
  do Union this was as more of an add/append should I be capturing the
  return
  value and releasing it from memory. Need help clarifying the semantics
  here.
 
  Also, in another related thread someone mentioned coalesce after union.
  Would I need to do the same on the instance RDD I'm calling Union on.
 
  Perhaps a method such as append would be useful and clearer.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org