[
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-11004.
-------------------------------
Resolution: Won't Fix
For the moment I'd like to consider this concluded, but as in all things, it
can be reopened to address a specific change.
> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
> Key: SPARK-11004
> URL: https://issues.apache.org/jira/browse/SPARK-11004
> Project: Spark
> Issue Type: New Feature
> Components: Shuffle
> Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable,
> predictable way with gracious failures and recovery. I have applications
> that are able to join 2 tables without error in Hive, but these same tables,
> when converted into RDDs, are unable to join in Spark (I am using the same
> cluster, and have played around with all of the memory configurations,
> persisting to disk, checkpointing, etc., and the RDDs are just too big for
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally
> runs into problems when the tables are just too big (e.g. the notorious 2GB
> shuffle limit issue, memory problems, etc.) There are so many parameters to
> adjust (number of partitions, number of cores, memory per core, etc.) that it
> is difficult to guarantee stability on a shared cluster (say, running Yarn)
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data,
> and enable developers to mix-and-match some Spark and MapReduce operations in
> the same program, rather than writing Hive scripts and stringing together
> Spark and MapReduce programs, which has extremely large overhead to convert
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie,
> sometimes using disk-only may help with stability!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]