You have the keys before and after reduceByKey. You want to do something based on the key "within" reduceByKey? it just calls combineByKey, so you can use that method for lower-level control over the merging.
Whether it's possible depends I suppose on what you mean to filter on. If it's just a property of the key, can't you just filter before reduceByKey? If it's a property of the key's value, don't you need to wait for the reduction to finish? or are you saying that you know a key should be filtered based on its value partway through the merge? I suppose you can use combineByKey to create a mergeValue function that changes an input type A into some other Option[B]; you output None if your criteria is reached, and your combine function returns None if either argument is None? it doesn't save 100% of the work but it may mean you only shuffle (key,None) for some keys if the map-side combine already worked out that the key would be filtered. And then after, run a flatMap or something to make Option[B] into B. On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi, > > Before I send out the keys for network shuffle, in reduceByKey after map + > combine are done, I would like to filter the keys based on some threshold... > > Is there a way to get the key, value after map+combine stages so that I can > run a filter on the keys ? > > Thanks. > Deb --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org