GitHub user ankurdave opened a pull request:
https://github.com/apache/spark/pull/1297
Add IndexedRDD, an efficient updatable key-value store
This PR adds IndexedRDD, an efficient key-value store built on RDDs.
IndexedRDD extends `RDD[(Long, V)]` by enforcing key uniqueness and
pre-indexing the entries for efficient joins and point lookups, updates, and
deletions.
It is implemented by (1) hash-partitioning the entries by key, (2)
maintaining a hash index within each partition, and (3) using purely functional
(immutable and efficiently updatable) data structures to enable efficient
modifications and deletions.
As a result, this PR also introduces package-private implementations of
three purely functional data structures: ImmutableVector,
ImmutableLongOpenHashSet, and ImmutableBitSet.
GraphX will be the first user of IndexedRDD. This PR uses IndexedRDD to
replace the current implementation of VertexRDD.
Here is a usage example for IndexedRDD:
```scala
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
the entries
val indexed = IndexedRDD(rdd).cache()
// Perform some point updates. Note that the original IndexedRDD is not
modified
val indexed2 = indexed.put(100L, 1).put(1234L, 10873).cache()
// Perform some point lookups
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2
!= 0)
indexed3.collect // => Array((100L, 1), (1234L, 10873))
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ankurdave/spark IndexedRDD
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1297.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1297
----
commit 6e56da08e6f7b60b290448a495b25c72a0b39878
Author: Ankur Dave <[email protected]>
Date: 2014-05-29T03:24:03Z
Start work on IndexedRDD interface
commit b17cf6af9924064bc9574e150cc18fd658ededc7
Author: Ankur Dave <[email protected]>
Date: 2014-06-10T22:10:26Z
Fix a bug
commit 087f9d1e27a22979bc010dbef6931e7e137e173c
Author: Ankur Dave <[email protected]>
Date: 2014-06-11T21:12:28Z
Make sure to specialize everywhere
commit 337e9c417b2ba57f711b7113d3edcd3d4e5468db
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T01:27:37Z
Fix SPARK-2011 on IndexedRDD
commit 35fbc27cf1816dcf1f7357d9dd5f3fe3f9603c5c
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T18:22:25Z
Add IndexedRDDBenchmark
commit b5e5a7895ab161670483d5d7af253a72216307a7
Author: Ankur Dave <[email protected]>
Date: 2014-06-13T02:48:03Z
Set numPartitions properly
commit 5854c60a52068a2203ed161b807bcdb6d3ac417b
Author: Ankur Dave <[email protected]>
Date: 2014-06-13T21:03:57Z
Avoid integer overflow in benchmark
commit 878f825f1834737f01ae208be2409b162fb44cfa
Author: Ankur Dave <[email protected]>
Date: 2014-06-13T21:25:56Z
Force puts
commit 68b253f36db466b5377d877866839242e579acd1
Author: Ankur Dave <[email protected]>
Date: 2014-06-13T22:20:17Z
Only force once
commit 475e67c2b4e98b85677eb531b578a0e43d27b41f
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T18:16:47Z
Compare against vanilla RDD in IndexedRDDBenchmark
commit c2dceb75a7ad4f951138936b8567f8a259682aca
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T19:09:42Z
Use count as well as foreach to demonstrate effect of boxing
commit 33729ebe39ffcb8a525d0e99c2a3fc3e084f37fe
Author: Ankur Dave <[email protected]>
Date: 2014-06-11T21:00:09Z
Switch to Vector as a purely functional value store
commit d4e7a7b753131c8f7dcf2068417a073816fc1371
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T22:42:45Z
Implement multiget
commit ae877698297527a51c7722f8a793af496d83db6d
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T22:47:59Z
Only run tasks on partition with requested keys
commit 9e2cf473ba896344c9ac84359935d3638189035a
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T22:51:59Z
Implement get in terms of multiget
commit b6104dc3a8fbd7f807f79eb4dd18078f38bb6e68
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T23:40:49Z
Implement multiput
commit 87fbcad133d563fa1c91fb0bd10f8b4f209c0358
Author: Ankur Dave <[email protected]>
Date: 2014-06-12T23:48:46Z
Implement put and fix mutation bug in multiput
commit 4fed1a227cb7461b31a08a5958e566ab77e3b3f8
Author: Ankur Dave <[email protected]>
Date: 2014-06-14T01:42:52Z
Even when indexes are different, zip optimistically with fallback
Also, switch the argument order of diff so it keeps values from the left
side (`self`). This is consistent with all the other operations.
commit 67f75c642c5f6e968b71b0591c1c34414591b4ec
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T18:22:18Z
Fix a bug
commit 98f809d5cf9d58099f14b04647b2a506041814e7
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T21:37:32Z
Add scaling microbenchmarks for get and put
commit 3f985f0f203c34ce561fcf35aae82baf594cc320
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T22:00:22Z
Test varying read-write mixtures
commit 0c495b52ad6fc0b654f5a148ab580a21bc056650
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T22:06:11Z
Test scaling for IndexedRDDPartition creation
commit 87d41da0864faaa70ba98d6b260bb1ab61c880eb
Author: Ankur Dave <[email protected]>
Date: 2014-06-16T22:26:00Z
Generate more datapoints for microbenchmarks
commit a4f0bc97e75691dcc439e913affd7332366f48c0
Author: Ankur Dave <[email protected]>
Date: 2014-06-20T19:13:37Z
WIP ImmutableLongOpenHashSet
commit 644d66c8c085595ef6c065f5f8f654490b1a51d3
Author: Ankur Dave <[email protected]>
Date: 2014-06-20T20:50:44Z
Finish ImmutableLongOpenHashSet and write test
commit 822b36198643d745b608187ea6fd1cf43f8c2d95
Author: Ankur Dave <[email protected]>
Date: 2014-06-21T02:28:53Z
WIP ImmutableVector
commit 14b30a7dcff64a2ff2dec4e65d94d8d625f3ab63
Author: Ankur Dave <[email protected]>
Date: 2014-06-23T20:35:08Z
Add VectorIterator
commit b915362d1f7794368a8936d0fccb6e3105db8ff2
Author: Ankur Dave <[email protected]>
Date: 2014-06-23T20:38:09Z
Make ImmutableVector classes private
commit 9232e917abb5609a13847af0f6c96925aee14c9d
Author: Ankur Dave <[email protected]>
Date: 2014-06-23T20:55:33Z
Add VectorNode.updated
commit 7440f4497080a5844cab5933cb74b827eb1cdf34
Author: Ankur Dave <[email protected]>
Date: 2014-06-23T21:43:08Z
Use ImmutableVector everywhere
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---