[jira] [Comment Edited] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Glenn Strycker (JIRA) Thu, 08 Oct 2015 11:14:41 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949076#comment-14949076
 ]


Glenn Strycker edited comment on SPARK-11004 at 10/8/15 6:13 PM:
-----------------------------------------------------------------

Currently we could do the following from within a linux wrapper script:

1) run Spark through creating the 2 RDDs, 2) save both to a Hive tables 
"table1" and "table2", 3) end my Spark program, 4) run Hive Beeline on a 
"select * from table1 join table2 on columnname" and insert this into a new 3rd 
table, 5) start a new Spark program to continue on from the first program in 
step 3, loading the output of the Hive join into the results RDD.

Rather than doing all of this overhead, it would be pretty cool if we could run 
the map and reduce jobs as a Hive join would do, only on our existing Spark 
RDDs, without needing to involve Hive or wrapper scripts, but doing everything 
from within a single Spark program.

I believe the main stability gain is due to Hive performing everything on disk 
instead of memory.  Since we already can checkpoint RDDs to memory, perhaps 
this ticket request could be accomplished by adding a feature to RDDs that 
would be performed on the checkpointed files, rather than the in-memory RDDs.

We're just looking for a stability gain and ability to increase the potential 
size of RDDs in their operations.  Right now our Hive Beeline scripts are 
out-performing Spark for these super massive table joins.


was (Author: glenn.stryc...@gmail.com):
Currently we could do the following from within a linux wrapper script:

1) run Spark through creating the 2 RDDs, 2) save both to a Hive table, 3) end 
my Spark program, 4) run Hive Beeline on a "select * from table1 join table2 on 
columnname" and insert this into a new 3rd table, 5) start a new Spark program 
to continue on from the first program in step 3, loading the output of the Hive 
join into the results RDD.

Rather than doing all of this overhead, it would be pretty cool if we could run 
the map and reduce jobs as a Hive join would do, only on our existing Spark 
RDDs, without needing to involve Hive or wrapper scripts, but doing everything 
from within a single Spark program.

I believe the main stability gain is due to Hive performing everything on disk 
instead of memory.  Since we already can checkpoint RDDs to memory, perhaps 
this ticket request could be accomplished by adding a feature to RDDs that 
would be performed on the checkpointed files, rather than the in-memory RDDs.

We're just looking for a stability gain and ability to increase the potential 
size of RDDs in their operations.  Right now our Hive Beeline scripts are 
out-performing Spark for these super massive table joins.

> MapReduce Hive-like join operations for RDDs
> --------------------------------------------
>
>                 Key: SPARK-11004
>                 URL: https://issues.apache.org/jira/browse/SPARK-11004
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Glenn Strycker
>
> Could a feature be added to Spark that would use disk-only MapReduce 
> operations for the very largest RDD joins?
> MapReduce is able to handle incredibly large table joins in a stable, 
> predictable way with gracious failures and recovery.  I have applications 
> that are able to join 2 tables without error in Hive, but these same tables, 
> when converted into RDDs, are unable to join in Spark (I am using the same 
> cluster, and have played around with all of the memory configurations, 
> persisting to disk, checkpointing, etc., and the RDDs are just too big for 
> Spark on my cluster)
> So, Spark is usually able to handle fairly large RDD joins, but occasionally 
> runs into problems when the tables are just too big (e.g. the notorious 2GB 
> shuffle limit issue, memory problems, etc.)  There are so many parameters to 
> adjust (number of partitions, number of cores, memory per core, etc.) that it 
> is difficult to guarantee stability on a shared cluster (say, running Yarn) 
> with other jobs.
> Could a feature be added to Spark that would use disk-only MapReduce commands 
> to do very large joins?
> That is, instead of myRDD1.join(myRDD2), we would have a special operation 
> myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run 
> MapReduce, and then convert the results of the join back into a standard RDD.
> This might add stability for Spark jobs that deal with extremely large data, 
> and enable developers to mix-and-match some Spark and MapReduce operations in 
> the same program, rather than writing Hive scripts and stringing together 
> Spark and MapReduce programs, which has extremely large overhead to convert 
> RDDs to Hive tables and back again.
> Despite memory-level operations being where most of Spark's speed gains lie, 
> sometimes using disk-only may help with stability!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11004) MapReduce Hive-like join operations for RDDs

Reply via email to