You have the keys before and after reduceByKey. You want to do
something based on the key "within" reduceByKey? it just calls
combineByKey, so you can use that method for lower-level control over
the merging.

Whether it's possible depends I suppose on what you mean to filter on.
If it's just a property of the key, can't you just filter before
reduceByKey? If it's a property of the key's value, don't you need to
wait for the reduction to finish? or are you saying that you know a
key should be filtered based on its value partway through the merge?

I suppose you can use combineByKey to create a mergeValue function
that changes an input type A into some other Option[B]; you output
None if your criteria is reached, and your combine function returns
None if either argument is None? it doesn't save 100% of the work but
it may mean you only shuffle (key,None) for some keys if the map-side
combine already worked out that the key would be filtered.

And then after, run a flatMap or something to make Option[B] into B.

On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das <debasish.da...@gmail.com> wrote:
> Hi,
>
> Before I send out the keys for network shuffle, in reduceByKey after map +
> combine are done, I would like to filter the keys based on some threshold...
>
> Is there a way to get the key, value after map+combine stages so that I can
> run a filter on the keys ?
>
> Thanks.
> Deb

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to