RE: Any Replicated RDD in Spark?

2014-11-06 Thread Shuai Zheng
-Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, November 05, 2014 6:27 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Any Replicated RDD in Spark? If you start with an RDD, you do have to collect to the driver and broadcast to do

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
(in theory, either way works, but in real world, which one is better?). Regards, Shuai -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Monday, November 03, 2014 4:15 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Any Replicated RDD in Spark? You

RE: Any Replicated RDD in Spark?

2014-11-05 Thread Shuai Zheng
! -Original Message- From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: Wednesday, November 05, 2014 3:32 PM To: 'Matei Zaharia' Cc: 'user@spark.apache.org' Subject: RE: Any Replicated RDD in Spark? Nice. Then I have another question, if I have a file (or a set of files: part-0, part-1

Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
: RE: Any Replicated RDD in Spark? Nice. Then I have another question, if I have a file (or a set of files: part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by other program), need to create hashtable from it, later broadcast it to each node to allow query (map side join). I

Any Replicated RDD in Spark?

2014-11-03 Thread Shuai Zheng
Hi All, I have spent last two years on hadoop but new to spark. I am planning to move one of my existing system to spark to get some enhanced features. My question is: If I try to do a map side join (something similar to Replicated key word in Pig), how can I do it? Is it anyway to declare a

Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your map function, you can look at the hash table you broadcast and see what records match it). Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by