On Fri, Jun 11, 2010 at 7:33 PM, Jeff Eastman <[email protected]> wrote: > complete enough for 'neophyte users' and 'regular users' and that only > 'power users' should be using the -D abstractions (and with that accepting > any idiosyncrasies that may result since we cannot guarantee how they may > interact).
That's a reasonable rule. All you really need to specify is input and output, and Hadoop's defaults should work reasonably from there. So I view this as an argument to create --input and --output, and that's done. > Since the degree of parallelism obtained is often a function of the number > of mappers/reducers specified, and since the degree of parallelism is > something our 'regular users' would reasonably need to control, perhaps > replacing the --numReducers options with --desiredParallelism (or something) > and having reasonable defaults on that for our neophytes would be better. > Then the implementation could take the user's desires into account and > internally manage the numbers of map and reduce tasks where it makes sense > to do so. On this flag in particular -- It's an appealing idea, but how do the details work? for example on the recommender jobs, there are at least 4 mapreduces, each of which have a fairly different best parallelism setting. The big, last phase should be parallelized as much as possible; early phases would just be slowed down by using too many mappers. What would the neophyte user using this flag do with it? Presumably the neophyte just wants it set to "optimal" or "as much as is reasonable" and that's basically what Hadoop is already doing, better than the user can determine. Encouraging the non-power-user to set number of mappers and reducers also has the potential to invite them to hurt performance. Do we have evidence the other way, that users regularly need to control this to achieve best performance? I personally actually never set it and let Hadoop base it on the file splits and blocks and such, which is a pretty good heuristic.
