Re: Join with large data set

2014-10-17 Thread Sonal Goyal
Hi Ankur,

If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.

HTH

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava 
ankur.srivast...@gmail.com wrote:

 Hi,

 I have a rdd which is my application data and is huge. I want to join this
 with reference data which is also huge to fit in-memory and thus I do not
 want to use Broadcast variable.

 What other options do I have to perform such joins?

 I am using Cassandra as my data store, so should I just query cassandra to
 get the reference data needed?

 Also when I join two rdds, will it result in rdd scan or would spark do a
 hash partition on the two rdds to get the data with same keys on same node?

 Thanks
 Ankur



Re: Join with large data set

2014-10-17 Thread Ankur Srivastava
Hi Sonal

Thank you for the response but since we are joining to reference data
different partitions of application data would need to join with same
reference data and thus I am not sure if spark join would be a good fit for
this.

Eg out application data has person with zip code and then the reference
data has attributes of zip code (city, state etc), so person objects in
different partitions in spark cluster may be referring to same zip and if I
partition our application data by zip there will be a lot of shuffling and
then latter for our application code we would have to repatriation with
another key and another shuffling of whole application data.

I think it will not be a good idea.

Thanks
Ankur
On Oct 16, 2014 11:06 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Ankur,

 If your rdds have common keys, you can look at partitioning both your
 datasets using a custom partitioner based on keys so that you can avoid
 shuffling and optimize join performance.

 HTH

 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Fri, Oct 17, 2014 at 4:27 AM, Ankur Srivastava 
 ankur.srivast...@gmail.com wrote:

 Hi,

 I have a rdd which is my application data and is huge. I want to join
 this with reference data which is also huge to fit in-memory and thus I do
 not want to use Broadcast variable.

 What other options do I have to perform such joins?

 I am using Cassandra as my data store, so should I just query cassandra
 to get the reference data needed?

 Also when I join two rdds, will it result in rdd scan or would spark do a
 hash partition on the two rdds to get the data with same keys on same node?

 Thanks
 Ankur





Join with large data set

2014-10-16 Thread Ankur Srivastava
Hi,

I have a rdd which is my application data and is huge. I want to join this
with reference data which is also huge to fit in-memory and thus I do not
want to use Broadcast variable.

What other options do I have to perform such joins?

I am using Cassandra as my data store, so should I just query cassandra to
get the reference data needed?

Also when I join two rdds, will it result in rdd scan or would spark do a
hash partition on the two rdds to get the data with same keys on same node?

Thanks
Ankur