[GitHub] spark pull request: Add IndexedRDD, an efficient updatable key-val...

ankurdave Thu, 03 Jul 2014 19:57:12 -0700

GitHub user ankurdave opened a pull request:

    https://github.com/apache/spark/pull/1297


    Add IndexedRDD, an efficient updatable key-value store

    This PR adds IndexedRDD, an efficient key-value store built on RDDs. 
IndexedRDD extends `RDD[(Long, V)]` by enforcing key uniqueness and 
pre-indexing the entries for efficient joins and point lookups, updates, and 
deletions.
    
    It is implemented by (1) hash-partitioning the entries by key, (2) 
maintaining a hash index within each partition, and (3) using purely functional 
(immutable and efficiently updatable) data structures to enable efficient 
modifications and deletions.
    
    As a result, this PR also introduces package-private implementations of 
three purely functional data structures: ImmutableVector, 
ImmutableLongOpenHashSet, and ImmutableBitSet.
    
    GraphX will be the first user of IndexedRDD. This PR uses IndexedRDD to 
replace the current implementation of VertexRDD.
    
    Here is a usage example for IndexedRDD:
    
    ```scala
    import org.apache.spark.rdd.IndexedRDD
    
    // Create an RDD of key-value pairs with Long keys
    val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
    // Construct an IndexedRDD from the pairs, hash-partitioning and indexing 
the entries
    val indexed = IndexedRDD(rdd).cache()
    
    // Perform some point updates. Note that the original IndexedRDD is not 
modified
    val indexed2 = indexed.put(100L, 1).put(1234L, 10873).cache()
    
    // Perform some point lookups
    indexed2.get(1234L) // => Some(10873)
    indexed.get(1234L) // => Some(0)
    
    // Efficiently join derived IndexedRDD with original
    val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 
!= 0)
    indexed3.collect // => Array((100L, 1), (1234L, 10873))
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ankurdave/spark IndexedRDD

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1297
    
----
commit 6e56da08e6f7b60b290448a495b25c72a0b39878
Author: Ankur Dave <[email protected]>
Date:   2014-05-29T03:24:03Z

    Start work on IndexedRDD interface

commit b17cf6af9924064bc9574e150cc18fd658ededc7
Author: Ankur Dave <[email protected]>
Date:   2014-06-10T22:10:26Z

    Fix a bug

commit 087f9d1e27a22979bc010dbef6931e7e137e173c
Author: Ankur Dave <[email protected]>
Date:   2014-06-11T21:12:28Z

    Make sure to specialize everywhere

commit 337e9c417b2ba57f711b7113d3edcd3d4e5468db
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T01:27:37Z

    Fix SPARK-2011 on IndexedRDD

commit 35fbc27cf1816dcf1f7357d9dd5f3fe3f9603c5c
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T18:22:25Z

    Add IndexedRDDBenchmark

commit b5e5a7895ab161670483d5d7af253a72216307a7
Author: Ankur Dave <[email protected]>
Date:   2014-06-13T02:48:03Z

    Set numPartitions properly

commit 5854c60a52068a2203ed161b807bcdb6d3ac417b
Author: Ankur Dave <[email protected]>
Date:   2014-06-13T21:03:57Z

    Avoid integer overflow in benchmark

commit 878f825f1834737f01ae208be2409b162fb44cfa
Author: Ankur Dave <[email protected]>
Date:   2014-06-13T21:25:56Z

    Force puts

commit 68b253f36db466b5377d877866839242e579acd1
Author: Ankur Dave <[email protected]>
Date:   2014-06-13T22:20:17Z

    Only force once

commit 475e67c2b4e98b85677eb531b578a0e43d27b41f
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T18:16:47Z

    Compare against vanilla RDD in IndexedRDDBenchmark

commit c2dceb75a7ad4f951138936b8567f8a259682aca
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T19:09:42Z

    Use count as well as foreach to demonstrate effect of boxing

commit 33729ebe39ffcb8a525d0e99c2a3fc3e084f37fe
Author: Ankur Dave <[email protected]>
Date:   2014-06-11T21:00:09Z

    Switch to Vector as a purely functional value store

commit d4e7a7b753131c8f7dcf2068417a073816fc1371
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T22:42:45Z

    Implement multiget

commit ae877698297527a51c7722f8a793af496d83db6d
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T22:47:59Z

    Only run tasks on partition with requested keys

commit 9e2cf473ba896344c9ac84359935d3638189035a
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T22:51:59Z

    Implement get in terms of multiget

commit b6104dc3a8fbd7f807f79eb4dd18078f38bb6e68
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T23:40:49Z

    Implement multiput

commit 87fbcad133d563fa1c91fb0bd10f8b4f209c0358
Author: Ankur Dave <[email protected]>
Date:   2014-06-12T23:48:46Z

    Implement put and fix mutation bug in multiput

commit 4fed1a227cb7461b31a08a5958e566ab77e3b3f8
Author: Ankur Dave <[email protected]>
Date:   2014-06-14T01:42:52Z

    Even when indexes are different, zip optimistically with fallback
    
    Also, switch the argument order of diff so it keeps values from the left
    side (`self`). This is consistent with all the other operations.

commit 67f75c642c5f6e968b71b0591c1c34414591b4ec
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T18:22:18Z

    Fix a bug

commit 98f809d5cf9d58099f14b04647b2a506041814e7
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T21:37:32Z

    Add scaling microbenchmarks for get and put

commit 3f985f0f203c34ce561fcf35aae82baf594cc320
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T22:00:22Z

    Test varying read-write mixtures

commit 0c495b52ad6fc0b654f5a148ab580a21bc056650
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T22:06:11Z

    Test scaling for IndexedRDDPartition creation

commit 87d41da0864faaa70ba98d6b260bb1ab61c880eb
Author: Ankur Dave <[email protected]>
Date:   2014-06-16T22:26:00Z

    Generate more datapoints for microbenchmarks

commit a4f0bc97e75691dcc439e913affd7332366f48c0
Author: Ankur Dave <[email protected]>
Date:   2014-06-20T19:13:37Z

    WIP ImmutableLongOpenHashSet

commit 644d66c8c085595ef6c065f5f8f654490b1a51d3
Author: Ankur Dave <[email protected]>
Date:   2014-06-20T20:50:44Z

    Finish ImmutableLongOpenHashSet and write test

commit 822b36198643d745b608187ea6fd1cf43f8c2d95
Author: Ankur Dave <[email protected]>
Date:   2014-06-21T02:28:53Z

    WIP ImmutableVector

commit 14b30a7dcff64a2ff2dec4e65d94d8d625f3ab63
Author: Ankur Dave <[email protected]>
Date:   2014-06-23T20:35:08Z

    Add VectorIterator

commit b915362d1f7794368a8936d0fccb6e3105db8ff2
Author: Ankur Dave <[email protected]>
Date:   2014-06-23T20:38:09Z

    Make ImmutableVector classes private

commit 9232e917abb5609a13847af0f6c96925aee14c9d
Author: Ankur Dave <[email protected]>
Date:   2014-06-23T20:55:33Z

    Add VectorNode.updated

commit 7440f4497080a5844cab5933cb74b827eb1cdf34
Author: Ankur Dave <[email protected]>
Date:   2014-06-23T21:43:08Z

    Use ImmutableVector everywhere

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add IndexedRDD, an efficient updatable key-val...

Reply via email to