Hello scalding users,
I've got a question about optimization of my flows. One can tune the number
of reducers per step easily, but there are very few tools to control the
number of mappers per step. I often use map-only steps with expensive
computation (e.g. with crosses or hashJoins), that is why I need a good
control of my mappers. I know two ways to control the number of mappers,
and both have disadvantages for me. The first one is via split.{minsize,
maxsize} job arguments, but it affects the whole flow, I can't change it
per job. The second one is via shard (which I personally like), but shard
triggers an extra map reduce step and we have software that monitors the
job efficiency and complains if it thinks the job abuses the resources.
Those shard jobs that split the data into very small chunks are always a
red flag for this monitoring software.
What I end up doing very often, to trick this software, is try to attach my
expensive map operation to the reduce step of the shard. For example, if my
next operation is a cross, that triggers a new MR job, I load the dataset I
cross with into memory using .toIterableExecution and replace "cross" call
with "map" call. I don't like using this pattern just to make tracking
software happy.
Are there any better alternative patterns that I possibly overlook?
Thanks,
Kostya
--
You received this message because you are subscribed to the Google Groups
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.