[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145698#comment-14145698 ]
Ankur Dave commented on SPARK-2365: ----------------------------------- [~imranr] Thanks for the comments and encouragement! I agree with all your points, and I've made subtasks for some of them: 1. We should definitely extract an IndexedRDD interface separate from the current purely-functional implementation (SPARK-3669). Right now I can think of two alternative implementations: a non-updatable one for better read performance (SPARK-3672), and one based on log-structured updates which might perform better when updates touch many different leaf nodes, which is likely when an update hits more than 1/32 (≈ 3%) of keys (SPARK-3670). 2. The best way to save to HDFS is probably to use saveAsTextFile, which will just save the key-value pairs and not the index. As you suggested, to load without a shuffle we'll need a way to give Spark an assumed partitioner. It would be nice if Spark supported that directly, but we could implement it in IndexedRDD if necessary. 3. Right, an inner join would be the best way to do a bulk multiget. To give an update on the overall status, for now IndexedRDD will be an external library that users can pull in rather than a part of Spark core. I'll move it from a pull request into a separate repository soon (SPARK-3673), though I hope to continue using the Spark JIRA for issue tracking. > Add IndexedRDD, an efficient updatable key-value store > ------------------------------------------------------ > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core > Reporter: Ankur Dave > Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org