This is definitely a map-increase job. I could try a combiner, but I don't think that would help. My keys are small compared to my values, and values must be kept separate when they are accumulated in the reducer--they can't be combined into some smaller form, i.e. they are more like bitmaps than word counts. So the only I/O a combiner would save for me is in the duplication of (relatively small) keys plus Hadoop's overhead for a <key, value> pair, which is going to be swamped by the values themselves.
On Thu, Sep 29, 2011 at 4:29 PM, Lance Norskog <[email protected]> wrote: > When in doubt, go straight to the owner of a fact. The operating system is > what really knows disk i/o. > "my mapper job--which may write multiple <key,value> pairs for each one it > receives--is writing too many" - ah, a map-increase job :) This is what > Combiners are for- to keep explosions of data from hitting the network by > combining in the mapper machine. > > On Thu, Sep 29, 2011 at 4:15 PM, W.P. McNeill <[email protected]> wrote: > > > I have a problem where certain Hadoop jobs take prohibitively long to > run. > > My hypothesis is that I am generating more I/O than my cluster can handle > > and I need to substantiate this. I am looking closely at the Map Reduce > > framework counters because I think they contain the information I need, > but > > I don't understand what the various File System Counters are telling me. > Is > > there a pointer to an list of exactly what all these counters mean? (So > far > > my online research has only turned up other people asking the same > > question.) > > > > In particular, I suspect that my mapper job--which may write multiple > <key, > > value> pairs for each one it receives--is writing too many and the values > > are too large, but I'm not sure how to test this quantitatively. > > > > Specific questions: > > > > 1. I assume "Map input records" is the total of all <key, value> pairs > > coming into the mappers and "Map output records" is the total of all > > <key, > > value> pairs written by the mapper. Is this correct? > > 2. What is "Map output bytes"? Is this the total number of bytes in all > > the <key, value> pairs written by the mapper? > > 3. How would I calculate a corresponding "Map input bytes"? Why doesn't > > that counter exist? > > 4. What is the relationship between the FILE|HDFS_BYTES_READ|WRITTEN > > counters? What exactly do they mean, and how do they relate to the "Map > > output bytes" counter? > > 5. Sometimes the FILE bytes read and written values are an order of > > magnitude larger than the corresponding HDFS values, and sometimes it's > > the > > other way around. How do I go about interpreting this? > > > > > > -- > Lance Norskog > [email protected] >
