Hey, I'm reimplementing a few Spark batch jobs as akka streams.
I got stuck at the last one that takes two PairRdd[Key,Value] and cogroups them by Key which returns an Rdd[Key,Seq[Value]] and then it processes Seq[Value] for each of the unique Keys that are present in both original PairRdds, which is kind of "batchy" operation. Moreover there is high Key cardinality, like 50% of keys are unique. So if I merged those two Sources and used groupBy then it would create as many SubFlows as number of distinct Keys, which could be max 5 millions. So my questions are : 1) Is there another way to do that? Note that I cannot use reduce like ops, I need the Seq[Value] physically present when the stream ends. 2) If not, is it Ok to have like 5M tiny SubFlows? 3) what should the parallelism be for this kind of groupBy operation in mergeSubstreamsWithParallelism? -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: >>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/d/optout.
