Hello devs,

I have found myself in a situation where Spark is doing sub-optimal 
computations for my RDDs, and I was wondering whether a patch to enable 
improved performance for this scenario would be a welcome addition to Spark or 

The scenario happens when trying to cogroup two RDDs that are sorted by key and 
share the same partitioner. CoGroupedRDD will correctly detect that the RDDs 
have the same partitioner and will therefore create narrow cogroup split 
dependencies, as opposed to shuffle dependencies. This is great because it 
prevents any shuffling from happening. However, the cogroup is unable to detect 
that the RDDs are sorted in the same way, and will still insert all elements of 
the RDD in a map in order to join the elements with the same key.

When both RDDs are sorted using the same order, the cogroup can just join by 
doing a single pass over the data (since the data is ordered by key, you can 
just keep iterating until you find a different key). This would greatly reduce 
the memory requirements for these kind of operations.

Adding this to spark would require adding an “ordering” member to RDD of type 
Option[Ordering], similarly to how the “partitioner” field works. That way, the 
sorting operations could populate this field and the operations that could 
benefit from this knowledge (cogroup, join, groupbykey, etc.) could read it to 
change their behavior accordingly.

Do you think this would be a good addition to Spark?



