subject:"Join with large data set"

Re: Join with large data set

2014-10-17 Thread Sonal Goyal

Hi Ankur, If your rdds have common keys, you can look at partitioning both your datasets using a custom partitioner based on keys so that you can avoid shuffling and optimize join performance. HTH Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal

Re: Join with large data set

2014-10-17 Thread Ankur Srivastava

Hi Sonal Thank you for the response but since we are joining to reference data different partitions of application data would need to join with same reference data and thus I am not sure if spark join would be a good fit for this. Eg out application data has person with zip code and then the

Join with large data set

2014-10-16 Thread Ankur Srivastava

Hi, I have a rdd which is my application data and is huge. I want to join this with reference data which is also huge to fit in-memory and thus I do not want to use Broadcast variable. What other options do I have to perform such joins? I am using Cassandra as my data store, so should I just