Hi Matt,
Here I believe strongly that we need tests. Nathan assured me that nothing is faster on the GPU than sort+reduce-by-key since they are highly optimized. I think they will be hard to beat, and the initial timings I had say that this is the case. I am willing to be wrong, but I am not willing to overengineer based on supposition.
Fair enough. Is a brute-force implementation for P1 elements sufficient as a baseline for discussion?
Best regards, Karli
