[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145698#comment-14145698
 ] 

Ankur Dave commented on SPARK-2365:
-----------------------------------

[~imranr] Thanks for the comments and encouragement! I agree with all your 
points, and I've made subtasks for some of them:

1. We should definitely extract an IndexedRDD interface separate from the 
current purely-functional implementation (SPARK-3669). Right now I can think of 
two alternative implementations: a non-updatable one for better read 
performance (SPARK-3672), and one based on log-structured updates which might 
perform better when updates touch many different leaf nodes, which is likely 
when an update hits more than 1/32 (≈ 3%) of keys (SPARK-3670).
2. The best way to save to HDFS is probably to use saveAsTextFile, which will 
just save the key-value pairs and not the index. As you suggested, to load 
without a shuffle we'll need a way to give Spark an assumed partitioner. It 
would be nice if Spark supported that directly, but we could implement it in 
IndexedRDD if necessary.
3. Right, an inner join would be the best way to do a bulk multiget.

To give an update on the overall status, for now IndexedRDD will be an external 
library that users can pull in rather than a part of Spark core. I'll move it 
from a pull request into a separate repository soon (SPARK-3673), though I hope 
to continue using the Spark JIRA for issue tracking.

> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to