Hi Ankur,
If your rdds have common keys, you can look at partitioning both your
datasets using a custom partitioner based on keys so that you can avoid
shuffling and optimize join performance.
HTH
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
Hi Sonal
Thank you for the response but since we are joining to reference data
different partitions of application data would need to join with same
reference data and thus I am not sure if spark join would be a good fit for
this.
Eg out application data has person with zip code and then the
Hi,
I have a rdd which is my application data and is huge. I want to join this
with reference data which is also huge to fit in-memory and thus I do not
want to use Broadcast variable.
What other options do I have to perform such joins?
I am using Cassandra as my data store, so should I just