I think you *can* tune the min/max size by using the sourceConfInit method in Sources and applying the setting there? It may not work though, I am not sure if this particular setting can be configured per-source or not.
It is up to the InputFormat to decide how many mappers, in part because some file types are not actually arbitrarily splittable -- there is often a max amount which you can split a file, and some formats (like gzipped data) can't be split at all. So that's where that comes from. On Sun, Aug 28, 2016 at 12:11 PM, P. Oscar Boykin <[email protected]> wrote: > There really is not such a great way to do this. You have found the tools > we usually recommend. This is more of an issue that Hadoop is not so > optimized to tune for this. > > It would be nice to be able to say, use exactly N mappers for this job. > This is not always possible because the input formats themselves have > something to say about how data is partitioned (well, actually, they have > complete control on that, as far as I know). > > Lastly, it goes a *bit* against the idea of a mapper, which should be that > it is the trivially parallelizable portion of your code. As such, you > basically want as many as possible to minimize latency. Due to fixed > startup costs, to minimize total cost, there is some optimal number mappers > to use if you knew the trade off between startup and job cost. > > Anyway, we don't have anything so great right now. > > On Sat, Aug 27, 2016 at 10:30 Kostya Salomatin <[email protected]> > wrote: > >> Hello scalding users, >> >> I've got a question about optimization of my flows. One can tune the >> number of reducers per step easily, but there are very few tools to >> control the number of mappers per step. I often use map-only steps with >> expensive computation (e.g. with crosses or hashJoins), that is why I need >> a good control of my mappers. I know two ways to control the number of >> mappers, and both have disadvantages for me. The first one is via >> split.{minsize, maxsize} job arguments, but it affects the whole flow, I >> can't change it per job. The second one is via shard (which I personally >> like), but shard triggers an extra map reduce step and we have software >> that monitors the job efficiency and complains if it thinks the job abuses >> the resources. Those shard jobs that split the data into very small chunks >> are always a red flag for this monitoring software. >> >> What I end up doing very often, to trick this software, is try to attach >> my expensive map operation to the reduce step of the shard. For example, if >> my next operation is a cross, that triggers a new MR job, I load the >> dataset I cross with into memory using .toIterableExecution and replace >> "cross" call with "map" call. I don't like using this pattern just to make >> tracking software happy. >> >> Are there any better alternative patterns that I possibly overlook? >> >> Thanks, >> Kostya >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Scalding Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "Scalding Development" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- Alex Levenson @THISWILLWORK -- You received this message because you are subscribed to the Google Groups "Scalding Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
