[
https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid reassigned SPARK-1061:
-----------------------------------
Assignee: Imran Rashid
> allow Hadoop RDDs to be read w/ a partitioner
> ---------------------------------------------
>
> Key: SPARK-1061
> URL: https://issues.apache.org/jira/browse/SPARK-1061
> Project: Spark
> Issue Type: New Feature
> Reporter: Imran Rashid
> Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a
> shuffle. However, after saving an RDD to hdfs, and then reloading it, all
> partitioner information is lost. This means that you can never get a narrow
> dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part
> file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function
> to RDD. It would create a new RDD, which had the exact same data but just
> pretended that the RDD had the given partitioner applied to it. And if
> verify=true, it could add a mapPartitionsWithIndex to check that each record
> was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]