Leftouterjoin and join apis are super slow in spark. 100x slower than hadoop
Sent from my iPhone
On 14-Jul-2015, at 10:59 PM, Wush Wu wush...@gmail.com wrote:
I don't understand.
By the way, the `joinWithCassandraTable` does improve my query time
from 40 mins to 3 mins.
2015-07-15
Dear Sujit,
Thanks for your suggestion.
After testing, the `joinWithCassandraTable` does the trick like what
you mentioned.
The rdd2 only query those data which have the same key in rdd1.
Best,
Wush
2015-07-16 0:00 GMT+08:00 Sujit Pal sujitatgt...@gmail.com:
Hi Wush,
One option may be to
Hi Wush,
One option may be to try a replicated join. Since your rdd1 is small, read
it into a collection and broadcast it to the workers, then filter your
larger rdd2 against the collection on the workers.
-sujit
On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:
Dear all,
I am trying to join two RDDs, named rdd1 and rdd2.
rdd1 is loaded from a textfile with about 33000 records.
rdd2 is loaded from a table in cassandra which has about 3 billions records.
I tried the following code:
```scala
val rdd1 : (String, XXX) = sc.textFile(...).map(...)
import
Dear all,
I have found a post discussing the same thing:
https://groups.google.com/a/lists.datastax.com/forum/#!searchin/spark-connector-user/join/spark-connector-user/q3GotS-n0Wk/g-LPTteCEg0J
The solution is using joinWithCassandraTable and the documentation
is here:
I don't understand.
By the way, the `joinWithCassandraTable` does improve my query time
from 40 mins to 3 mins.
2015-07-15 13:19 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
I have explored spark joins for last few months (you can search my posts)
and its frustrating useless.
On Tue, Jul
I have explored spark joins for last few months (you can search my posts)
and its frustrating useless.
On Tue, Jul 14, 2015 at 9:35 PM, Wush Wu wush...@gmail.com wrote:
Dear all,
I have found a post discussing the same thing: