[ https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075854#comment-15075854 ]
Sean Owen commented on SPARK-1061: ---------------------------------- Is this still live? > allow Hadoop RDDs to be read w/ a partitioner > --------------------------------------------- > > Key: SPARK-1061 > URL: https://issues.apache.org/jira/browse/SPARK-1061 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Imran Rashid > Assignee: Imran Rashid > > Using partitioners to get narrow dependencies can save tons of time on a > shuffle. However, after saving an RDD to hdfs, and then reloading it, all > partitioner information is lost. This means that you can never get a narrow > dependency when loading data from hadoop. > I think we could get around this by: > 1) having a modified version of hadoop rdd that kept track of original part > file (or maybe just prevent splits altogether ...) > 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function > to RDD. It would create a new RDD, which had the exact same data but just > pretended that the RDD had the given partitioner applied to it. And if > verify=true, it could add a mapPartitionsWithIndex to check that each record > was in the right partition. > http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org