Hi,
I am looking for suitable issue for Master Degree project(it sounds like
scalability problems and improvements for spark streaming) and seems like
introduction of grouped RDD(for example: don't store
"Spark", "Spark", "Spark", instead store ("Spark", 3)) can:
1. Reduce memory needed for RDD (roughly, used memory will be: % of uniq
messages)
2. Improve performance(no need to apply function several times for the same
message).
Can I create ticket and introduce API for grouped RDDs? Is it make sense?
Also I will be very appreciated for critic and ideas