I think you *can* tune the min/max size by using the sourceConfInit method
in Sources and applying the setting there? It may not work though, I am not
sure if this particular setting can be configured per-source or not.

It is up to the InputFormat to decide how many mappers, in part because
some file types are not actually arbitrarily splittable -- there is often a
max amount which you can split a file, and some formats (like gzipped data)
can't be split at all. So that's where that comes from.

On Sun, Aug 28, 2016 at 12:11 PM, P. Oscar Boykin <[email protected]>
wrote:

> There really is not such a great way to do this. You have found the tools
> we usually recommend. This is more of an issue that Hadoop is not so
> optimized to tune for this.
>
> It would be nice to be able to say, use exactly N mappers for this job.
> This is not always possible because the input formats themselves have
> something to say about how data is partitioned (well, actually, they have
> complete control on that, as far as I know).
>
> Lastly, it goes a *bit* against the idea of a mapper, which should be that
> it is the trivially parallelizable portion of your code. As such, you
> basically want as many as possible to minimize latency. Due to fixed
> startup costs, to minimize total cost, there is some optimal number mappers
> to use if you knew the trade off between startup and job cost.
>
> Anyway, we don't have anything so great right now.
>
> On Sat, Aug 27, 2016 at 10:30 Kostya Salomatin <[email protected]>
> wrote:
>
>> Hello scalding users,
>>
>> I've got a question about optimization of my flows. One can tune the
>> number of  reducers per step easily, but there are very few tools to
>> control the number of mappers per step. I often use map-only steps with
>> expensive computation (e.g. with crosses or hashJoins), that is why I need
>> a good control of my mappers. I know two ways to control the number of
>> mappers, and both have disadvantages for me. The first one is via
>> split.{minsize, maxsize} job arguments, but it affects the whole flow, I
>> can't change it per job. The second one is via shard (which I personally
>> like), but shard triggers an extra map reduce step and we have software
>> that monitors the job efficiency and complains if it thinks the job abuses
>> the resources. Those shard jobs that split the data into very small chunks
>> are always a red flag for this monitoring software.
>>
>> What I end up doing very often, to trick this software, is try to attach
>> my expensive map operation to the reduce step of the shard. For example, if
>> my next operation is a cross, that triggers a new MR job, I load the
>> dataset I cross with into memory using .toIterableExecution and replace
>> "cross" call with "map" call. I don't like using this pattern just to make
>> tracking software happy.
>>
>> Are there any better alternative patterns that I possibly overlook?
>>
>> Thanks,
>> Kostya
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Scalding Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Scalding Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Alex Levenson
@THISWILLWORK

-- 
You received this message because you are subscribed to the Google Groups 
"Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to