Re: Flow optimization question.

P. Oscar Boykin Sun, 28 Aug 2016 12:12:07 -0700

There really is not such a great way to do this. You have found the tools
we usually recommend. This is more of an issue that Hadoop is not so
optimized to tune for this.


It would be nice to be able to say, use exactly N mappers for this job.
This is not always possible because the input formats themselves have
something to say about how data is partitioned (well, actually, they have
complete control on that, as far as I know).

Lastly, it goes a *bit* against the idea of a mapper, which should be that
it is the trivially parallelizable portion of your code. As such, you
basically want as many as possible to minimize latency. Due to fixed
startup costs, to minimize total cost, there is some optimal number mappers
to use if you knew the trade off between startup and job cost.

Anyway, we don't have anything so great right now.
On Sat, Aug 27, 2016 at 10:30 Kostya Salomatin <[email protected]> wrote:

> Hello scalding users,
>
> I've got a question about optimization of my flows. One can tune the
> number of  reducers per step easily, but there are very few tools to
> control the number of mappers per step. I often use map-only steps with
> expensive computation (e.g. with crosses or hashJoins), that is why I need
> a good control of my mappers. I know two ways to control the number of
> mappers, and both have disadvantages for me. The first one is via
> split.{minsize, maxsize} job arguments, but it affects the whole flow, I
> can't change it per job. The second one is via shard (which I personally
> like), but shard triggers an extra map reduce step and we have software
> that monitors the job efficiency and complains if it thinks the job abuses
> the resources. Those shard jobs that split the data into very small chunks
> are always a red flag for this monitoring software.
>
> What I end up doing very often, to trick this software, is try to attach
> my expensive map operation to the reduce step of the shard. For example, if
> my next operation is a cross, that triggers a new MR job, I load the
> dataset I cross with into memory using .toIterableExecution and replace
> "cross" call with "map" call. I don't like using this pattern just to make
> tracking software happy.
>
> Are there any better alternative patterns that I possibly overlook?
>
> Thanks,
> Kostya
>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Flow optimization question.

Reply via email to