Hey,

I'm reimplementing a few Spark batch jobs as akka streams. 

I got stuck at the last one that takes two PairRdd[Key,Value] and cogroups 
them by Key
which returns an Rdd[Key,Seq[Value]] and then it processes Seq[Value] for 
each of the unique Keys that are present in both original PairRdds, which 
is kind of "batchy" operation. Moreover there is high Key cardinality, like 
50% of keys are unique.

So if I merged those two Sources and used groupBy then it would create as 
many SubFlows as number of distinct Keys, which could be max 5 millions.

So my questions are :

1) Is there another way to do that? Note that I cannot use reduce like ops, 
I need the Seq[Value] physically present when the stream ends.
2) If not, is it Ok to have like 5M tiny SubFlows? 
3) what should the parallelism be for this kind of groupBy operation in 
mergeSubstreamsWithParallelism? 

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to