Hello scalding users,

I've got a question about optimization of my flows. One can tune the number 
of  reducers per step easily, but there are very few tools to control the 
number of mappers per step. I often use map-only steps with expensive 
computation (e.g. with crosses or hashJoins), that is why I need a good 
control of my mappers. I know two ways to control the number of mappers, 
and both have disadvantages for me. The first one is via split.{minsize, 
maxsize} job arguments, but it affects the whole flow, I can't change it 
per job. The second one is via shard (which I personally like), but shard 
triggers an extra map reduce step and we have software that monitors the 
job efficiency and complains if it thinks the job abuses the resources. 
Those shard jobs that split the data into very small chunks are always a 
red flag for this monitoring software.

What I end up doing very often, to trick this software, is try to attach my 
expensive map operation to the reduce step of the shard. For example, if my 
next operation is a cross, that triggers a new MR job, I load the dataset I 
cross with into memory using .toIterableExecution and replace "cross" call 
with "map" call. I don't like using this pattern just to make tracking 
software happy.

Are there any better alternative patterns that I possibly overlook?

Thanks,
Kostya

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to