[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

Akshat Aranya (JIRA) Thu, 16 Oct 2014 09:11:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173904#comment-14173904
 ]


Akshat Aranya commented on SPARK-2365:
--------------------------------------

This looks great!  I have been using IndexedRDD for a while, to great effect.  
I have one suggestion: it would be nice to override setName() in IndexedRDDLike

{code}
override def setName(_name: String): this.type = {
  partitionsRDD.setName(_name)
  this
}
{code}

so that the IndexedRDD shows up with friendly names in the storage UI, just 
like regular, cached RDDs do.


> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

Reply via email to