[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272048#comment-15272048
 ] 

Jyoti Misra commented on SPARK-2365:
------------------------------------

We have migrated our application in Spark and all the use cases work very well 
except updation of RDDs.
Ankur's IndexedRDD is a ray of hope for us to enhance the performance of this 
use case as well.

But we are not able to achieve the same because we are not able to leverage in 
Spark on Java. And the examples cited on websites are for Scala.

When we try to convert Java RDD to IndexedRDD 
(https://github.com/amplab/spark-indexedrdd) we are getting Classcast 
Exception. 
Is there any way to convert ?

Below is the code snippet:

JavaPairRDD<String, String> mappedRDD =  lines.flatMapToPair( new 
PairFlatMapFunction<String, String, String>()
    {
        @Override
        public Iterable<Tuple2<String, String>> call(String arg0) throws 
Exception {

            String[] arr = arg0.split(" ",2);
            System.out.println( "lenght" + arr.length);
             List<Tuple2<String, String>> results = new 
ArrayList<Tuple2<String, String>>();
             results.addAll(results);
            return results;
        }
    });        
    IndexedRDD<String,String> test = (IndexedRDD<String,String>) 
mappedRDD.collectAsMap()

The above gives class cast exception.

We also tried using below code:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());
The above line gives compile time error  - The constructor 
IndexedRDD<String,String>(JavaPairRDD<String,String>) is undefined

We are using Spark version 1.4.1:       
<dependency> <groupId>org.apache.spark</groupId> 
<artifactId>spark-core_2.10</artifactId> <version>1.4.1</version> </dependency>

We would appreciate any help on this.

> Add IndexedRDD, an efficient updatable key-value store
> ------------------------------------------------------
>
>                 Key: SPARK-2365
>                 URL: https://issues.apache.org/jira/browse/SPARK-2365
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX, Spark Core
>            Reporter: Ankur Dave
>            Assignee: Ankur Dave
>         Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to