[jira] [Resolved] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Sean Owen (JIRA) Thu, 29 Oct 2015 07:06:44 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-11004.
-------------------------------
    Resolution: Won't Fix

For the moment I'd like to consider this concluded, but as in all things, it 
can be reopened to address a specific change.

> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
>                 Key: SPARK-11004
>                 URL: https://issues.apache.org/jira/browse/SPARK-11004
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle
>            Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Reply via email to