[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272048#comment-15272048 ]
Jyoti Misra commented on SPARK-2365: ------------------------------------ We have migrated our application in Spark and all the use cases work very well except updation of RDDs. Ankur's IndexedRDD is a ray of hope for us to enhance the performance of this use case as well. But we are not able to achieve the same because we are not able to leverage in Spark on Java. And the examples cited on websites are for Scala. When we try to convert Java RDD to IndexedRDD (https://github.com/amplab/spark-indexedrdd) we are getting Classcast Exception. Is there any way to convert ? Below is the code snippet: JavaPairRDD<String, String> mappedRDD = lines.flatMapToPair( new PairFlatMapFunction<String, String, String>() { @Override public Iterable<Tuple2<String, String>> call(String arg0) throws Exception { String[] arr = arg0.split(" ",2); System.out.println( "lenght" + arr.length); List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>(); results.addAll(results); return results; } }); IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap() The above gives class cast exception. We also tried using below code: IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd()); The above line gives compile time error - The constructor IndexedRDD<String,String>(JavaPairRDD<String,String>) is undefined We are using Spark version 1.4.1: <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.4.1</version> </dependency> We would appreciate any help on this. > Add IndexedRDD, an efficient updatable key-value store > ------------------------------------------------------ > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core > Reporter: Ankur Dave > Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org